David’s Blog

Cassandra data storage performance statistics with inotify and fincore

Posted in Systems Engineering / Unix Systems Operations by david415 on September 3, 2010

We were interested in knowing which files were accessed (reads and seeks)
the most among the Cassandra data files (index, filter and data files)…
I wrote this simple Python program, inotify-access, to print these file read access statistics for a given time duration and directory. Find it in my github repository:
http://github.com/david415/inotify-access

This program makes use of the Linux kernel’s inotify system call.
If you run Debian/Ugabuga (I mean Ubuntu) then you’ll need to install the pynotify library: apt-get install python-pyinotify

To determine page cache usage of these files you can use fincore. I have forked the linux-ftools’s fincore to make the output more easily parsable at my github repository here:
http://github.com/david415/linux-ftools

Additionally I’ve written cassandra_pagecache_usage, a Python program that uses the mincore system call to report page cache usage for Cassandra data sets. Previously this program used to parse the output of linux-ftools’s fincore. However I have since switched to using the
the Python C extension fincore_ratio which returns a 2 tuples (cached pages, total pages). python-ftools is a linux-ftools port to Python C extensions; find it in my github repository here :
http://github.com/david415/python-ftools

I’ve written a Python version of fadvise for the commandline…
fadvise example usage:

Perhaps your cassandra node has been rebooted. You could “warm up” certain Column Families like this :

./fadvise -m willneed /mnt/var/cassandra/data/BunnyFufu/ForestActivity*

Find cassandra_pagecache_usage at my github repository here:
http://github.com/david415/cassandra-pagecache-usage

Usage: cassandra_pagecache_usage [options] <cassandra-data-directory>

Options:
  -h, --help            show this help message and exit
  -c, --columnfamily-summarize
                        Summarize cached Cassandra data on a per Column Family
                        basis.
  --exclude-filter      Exclude statistics for Cassandra Filter files.
  --exclude-index       Exclude statistics for Cassandra Index files.
  --exclude-data        Exclude statistics for Cassandra Data files.
 

example output:

my-cassandra-node:~/bin# PYTHONPATH=~/lib ./cassandra_pagecache_usage -c /mnt/var/cassandra/data/BunnyFufu/
Column Family    Bytes in FS page-cache
ForestActivity   3712839680
Indexes          2902822912
AnimalIndex      2369015808
AnimalCounts     1470619648
Items            786214912
Activity         264978432
Animals          133816320
Hops             127442944

2 Responses

Subscribe to comments with RSS.

  1. Stu Hood said, on October 13, 2010 at 10:58 am

    Would love to see some example output… this looks like a great toolset.

  2. david415 said, on October 15, 2010 at 9:01 am

    i’ve updated this blog post with example output for cassandra_pagecache_usage


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.