Lately I’ve been asked to solve np-complete programming problems during interviews. This is interesting! It lead me to find Yizhan Sun’s excellent dissertation paper called “Complexity of System Conﬁguration Management“. This paper blew my mind even though I cannot understand the formal mathematical proofs; the concepts are easy enough to grok… This inspired me to sign up for a Coursera class that covers mathematical proofs.
The paper illustrates many insights into the theory behind systems administration from a computational theory perspective. Read it!
This is a Python program that I should have written during a phone interview (shared online edit session); instead I wrote it afterwards. This program demonstrates an efficient functional pipeline of data without large intermediate storage. It’s a simple program. Read in a file of host names and return a subset of the hosts that can be connected to via HTTP.
#!/usr/bin/python import urllib2 import itertools def try_http_host(host): try: urllib2.urlopen("http://%s/" % host) return True except urllib2.HTTPError: return False def main(): for x in itertools.ifilter(try_http_host, itertools.imap(lambda x: x.strip(), open('hosts'))): print x if __name__ == '__main__': main()
Simple interview programming tasks like this are meant to start a conversation; we can talk about what it would take to make it more efficient and what problems these solutions present us.
Briefly I will describe how I automate DNS updates to PowerDNS (with mysql backend) with bcfg2;
This pattern of using bcfg2 to verify contents of and automate updates to a mysql database could be applied to many different applications. I use it for PowerDNS.
the generic pattern
- write a TGenshi template which generates a correct mysql dump
- write a Bcfg2 probe to dump the mysql data in place of the aforementioned templated dump
- to test data: run bcfg2 in dry-run mode; tests if the mysql dump generated by the probe differs from the TGenshi templated mysql dump
- to update the mysql database: run bcfg2 in non-dry-run mode to update the mysql database via a BoundAction in the Bcfg2 Bundle
Here is what our Power DNS Bundle looks like:
<Bundle name='pdns-server' version='2.0'> <BoundPackage name='pdns-server' version='2.9.22-1~bpo50+1' type='deb'/> <BoundPackage name='pdns-backend-mysql' version='2.9.22-1~bpo50+1' type='deb'/> <BoundService name="pdns" status="on" type="deb"/> <Path name='/etc/powerdns/pdns.conf'/> <Path name='/etc/powerdns/pdns.d/pdns.local'/> <Path name='/etc/powerdns/pdns.d/master.conf'/> <Path name='/etc/powerdns/powerdns_reload.sh'/> <BoundAction name="reload_powerdns" timing="post" when="modified" command="bash /etc/powerdns/powerdns_reload.sh" status="ignore" /> <Path name='/etc/powerdns/powerdns.sql'/> </Bundle>
And here we have the reload_powerdns action script:
#!/bin/bash mysql powerdns < /etc/powerdns/powerdns.sql echo 'DROP TABLE IF EXISTS `domains_old`; DROP TABLE IF EXISTS `records_old`; RENAME TABLE domains TO domains_old, records TO records_old, domains_new TO domains, records_new TO records;'|mysql powerdns
As you can see from this reload script, our TGenshi template (/etc/powerdns/powerdns.sql) which generates the PowerDNS mysqldump has two tables; domains_new and records_new… which this reload script renames to doamins and records. Simple!
Now let’s take a look at our Powerdns probe:
#!/bin/bash if [ -d /var/lib/mysql/powerdns ]; then mysqldump powerdns domains records | grep -v 'Server version' | grep -v '^-- Dump completed on' \ | sed -e 's/`domains`/`domains_new`/g' | sed -e 's/`records`/`records_new`/g' | sed -e 's/),(/),\n(/g' > /etc/powerdns/powerdns.sql fi
As you can see from this probe that I in fact lied about the TGenshi template creating an exact mysqldump. It does so with a newline character at the end of each tuple so that when we run the bcfg2 client in interactive mode (e.g. bcfg2 -qedI) we can see a readable diff showing us which line in the dump has changed…
And finally our TGenshi template which generates the mysqldump loads most of it’s data from several XML properties files as well as bcfg2 metadata…
We use MyIsam tables on our archive mysql database servers in an append only manner. This means that we’ve got data files that are continuously growing in size… In each mysql database archive shard there is a “hot” replica that is responsible for caching (in the Linux file system page cache) the last 4 hours worth of data. This latest 4 hours worth of data at the end of the file we call the “hot-window”. I’ve written a couple tools to help us detect if the hot-window is in page cache. I’ve added report_filechanges and pages_by_time to my python-ftools project on github.
Usage: report_filechanges [options]
-h, --help show this help message and exit
historical log directory
director to monitor
report_filechanges is called from cron every five minutes; it produces a log file entry for each file in the Mysql MyIsam data directory. Each log entry contains the filename, the time (in Unix epoch seconds) and the file size. We later use these historical file sizes as offsets in the file. This gives us the ability to determine the boundary of the hot region.
Usage: pages_by_time [options]
-h, --help show this help message and exit
--evict Evict historical region of files
--summary Summarize page cache usage of file regions
historical log directory
pages_by_time uses these logs files to summarize page cache usage (via the mincore system call) or to evict pages from file system page cache (via the fadvise system call).
We were interested in knowing which files were accessed (reads and seeks)
the most among the Cassandra data files (index, filter and data files)…
I wrote this simple Python program, inotify-access, to print these file read access statistics for a given time duration and directory. Find it in my github repository:
This program makes use of the Linux kernel’s inotify system call.
If you run Debian/Ugabuga (I mean Ubuntu) then you’ll need to install the pynotify library: apt-get install python-pyinotify
To determine page cache usage of these files you can use fincore. I have forked the linux-ftools’s fincore to make the output more easily parsable at my github repository here:
Additionally I’ve written cassandra_pagecache_usage, a Python program that uses the mincore system call to report page cache usage for Cassandra data sets. Previously this program used to parse the output of linux-ftools’s fincore. However I have since switched to using the
the Python C extension fincore_ratio which returns a 2 tuples (cached pages, total pages). python-ftools is a linux-ftools port to Python C extensions; find it in my github repository here :
I’ve written a Python version of fadvise for the commandline…
fadvise example usage:
Perhaps your cassandra node has been rebooted. You could “warm up” certain Column Families like this :
./fadvise -m willneed /mnt/var/cassandra/data/BunnyFufu/ForestActivity*
Find cassandra_pagecache_usage at my github repository here:
Usage: cassandra_pagecache_usage [options] <cassandra-data-directory> Options: -h, --help show this help message and exit -c, --columnfamily-summarize Summarize cached Cassandra data on a per Column Family basis. --exclude-filter Exclude statistics for Cassandra Filter files. --exclude-index Exclude statistics for Cassandra Index files. --exclude-data Exclude statistics for Cassandra Data files.
my-cassandra-node:~/bin# PYTHONPATH=~/lib ./cassandra_pagecache_usage -c /mnt/var/cassandra/data/BunnyFufu/ Column Family Bytes in FS page-cache ForestActivity 3712839680 Indexes 2902822912 AnimalIndex 2369015808 AnimalCounts 1470619648 Items 786214912 Activity 264978432 Animals 133816320 Hops 127442944
This Mysql database model I am about to describe could be applied to a distributed database built on top of Mysql.
In that sense I’m describing a single shard in a distributed database.
The shard consists of one master replicating to two slaves.
When I talk about database replicas (a master and slaves) crashing, I’m referring to any catastrophic failure that causes us to have to restore the entire dataset. Mysql replication has a number of race conditions that can result in data corruption when the Mysql node is rebooted, seg faulted, kernel paniced or otherwise crashed. Fiery explosions and semi automatic gun fire obviously also cause hardware failure and data loss. In the case of hardware failures we use one of the spare servers…
In this model we are prepared to handle any of these catastrophic failures to a single node in the shard.
Either the master or one of the two slaves crash. If a slave crashes we use the other slave as our data source to recover from.
If the master database crashes then we promote a slave to be master.
The old master becomes a slave; we recover it’s data from the other slave.
This procedure assumes all the slaves are at the same replication binary log position.
I imagine it would be nice to use Google’s Global Transaction ID patch (read about Global Transactions IDs here); That way if the slaves were behind in replication, the most up to date slave could be promoted
to be the master and the other slave could easily resume replication with the new master.
Without Google’s Global Transaction ID patch to Mysql we’d have to be more clever to be able to restore the slave from the newly promoted master since the new master will have a different set of transaction IDs than the old master. This shouldn’t be too hard with log_slave_updates enabled on all replicas and some simple subtraction of binary log positions.
Normally client queries would only be directed to the master database. If we were to scale reads to the slaves (this is a waste of cache!) then we may not have enough io capacity available to both restore the crashed node and service database queries within the SLA (service level agreement)… Therefore if we are going to scale reads to all three nodes we must first do load testing to determine the max queries per second we can safely send each node during normal operation and during recovering…
Without the benefit of load testing we’d have to send the read queries to only the master so that we can guarantee the io capacity of one replica for operational capacity for recovering from outages, rolling upgrades etc.
I do not ever want my laptop or any of the servers I maintain to swap.
For all my applications it is never a good idea to swap. RAM is cheap.
It is true that I don’t want the OOM killer to kill the wrong process… But I don’t worry about
that because I know how to use the oom_adj to prevent Mysql or sshd from getting killed.
According to a knowledgeable Linux kernel developer I spoke with if you are going to disable swap (as in swapoff -a) it is beneficial to recompile with the kernel config option SWAP=n. This avoids a performance problem; it’s been reported that kswapd will otherwise use lots of CPU. While your configuring your new kernel you may also want to disable tmpfs…
# swapon -a
swapon: /dev/mapper/longhaul-swap_1: swapon failed: Function not implemented
Please do not swap ever again!
Lately for Spinn3r I’ve been writing programs in Python to automate database operations. This hotcopy script utilizes LVM. Previously we didn’t use LVM and so this script used to do an Innodb Freeze. The simple idea here is to restore a mysql shard replica’s data from another replica…
Look at the code here:
mysql cluster tools @ bitbucket.org
The cool feature this hotcopy script has is an undo/restore mechanism that
kicks in when an exception is caught. Basically I’ve implemented a nullary function queue.
With each state change, I push the reverse (or undo/restore operation) onto the queue. When an exception is caught the rollback() method repeatedly pops and executes the last nullary off the queue until there are none left.
What I’m calling a nullary function is merely a function call wrapped in a closure (although maybe that’s not correct because they say Python doesn’t have true closures) in the form of a lambda with zero arguments.
Python’s excellent exception handling makes this rollback() feature really useful because there are lots of moving parts at work that could break and throw an exception. The hotcopy process causes several state changes on the source replica such as : set single user mode, stop replication (if the source is a slave), take LVM snapshot, mount snapshot etc…
I don’t like LVM snapshots laying around so we can COW forever! Nor would I want a database server to remain in “single user mode”…
Three ways to extend this project:
Check it out at bitbucket.org, my Mysql Innodb Hotcopy program.
This is not even a release candidate… but if anyone wants to look at the code… feel free.
Please leave your thoughts and comments.
At first Ganglia seems like an excellent tool.
It seems to have an excellently efficient and reliable/highly available design.
But actually Ganglia is very brittle and suffers numerous bugs and design flaws.
Ganglia seems to be mostly self configuring… except for a few modules out there that need configuration parameters. If I need to use these modules I’ll modify them to be self configuring like I did with multidisk.py… Anyone who maintains a fair sized cluster knows with a little code it pays to be lazy.
Ganglia seems to be way better than Cacti and the rest. Initially I was disappointed with the standard set of modules which didn’t allow me to monitor the throughput on two different ethernet interfaces. Perhaps the ganglia-developers don’t run multi-homed servers. Or maybe they just don’t care about how much throughput they use. I wrote a python module to graph usage for an internal and external interface because this will help us project how much we’d be paying a different data-center facility to host our cluster.
Gilad Raphaelli wrote a really cool embedded python module for monitoring mysql metrics… It monitors 100 metrics including various Innodb buffers. Perfect! It also comes with one custom report for queries :
I haven’t had time to look at it yet but Silvan Mühlemann wrote
a custom report php script for mysql
Admittedly I don’t know how best to use gmond’s python module interface even though I’ve written several modules already… I think its supposed to make it easy to write metric monitors with some embedded python code. But I like using gmetric more. It seems to be less code to produce the equivalent monitoring of metrics than the embedded python interface.
Also I’ve noticed people trying to use the python interface in interesting ways… e.g. spawning metric collector threads that populate a cache etc. I was hoping to be able to write ganglia metric modules without having to think about threads, cache and race conditions. Couldn’t a module harness take care of these details?
I’m working on gmetric-daemon which is a simple python forking daemon with a modular interface that calls gmetric (via system() or popen())… It’s not very memory efficient.
Perhaps using gmetric in this way is a silly approach… But right now I think I like it.
Here’s the gitHub repository for the work in progress I call
I got Postfix e-mail submissions to Bugzilla (3.0.5) working properly.
Perhaps these notes of mine could save someone some trouble when attempting this.
Certainly postfix could accomplish e-mail submissions via a custom transport using a pipe.
I however decided to use a pipe in the /etc/aliases file; mine contains this important line :
bug-submit: "|/var/www/bugz/email_in.pl -vvv 2>/tmp/emailin.log"
Note that for troubleshooting I can take a look at email_in.pl’s STDERR in /tmp/emailin.log;
Log in via the Bugzilla admin account and go to the Email section of the Parameters page.
Change the mailfrom to match the above e-mail alias so that Bugzilla users can add a comment to a bug by replying to Bugzilla’s e-mails.
I’m using SPF to verify sender e-mail addresses.
Here’s part of my /etc/postfix/main.cf containing some configuration for SPF :
alias_maps = hash:/etc/aliases
policyd-spf_time_limit = 3600
and part of my /etc/postfix/master.cf :
policy-spf unix - n n - - spawn
Next I got DomainKeys working.
SPF and DomainKeys are especially important for this setup because
Bugzilla will not be doing and e-mail spam filtering.
All a spammer would have to do to submit annoying bugs into our Bugzilla
system would be to forge an e-mail from a Bugzilla user’s e-mail address
and send it to email@example.com… This is why I want SPF and DomainKeys fully
operational… that way many forgery attempts will be rejected.
The DKIM filters for inbound and outbound mail are started like this :
/usr/local/dkimproxy/bin/dkimproxy.in --listen=127.0.0.1:10025 --relay=127.0.0.1:10026 \ --user=dkim --group=dkim --daemonize --pidfile=/var/run/dkimproxy.in /usr/local/dkimproxy/bin/dkimproxy.out --listen=127.0.0.1:10027 --relay=127.0.0.1:10028 \ --keyfile=/usr/local/dkimproxy/etc/private.key --selector=selector1 --domain=bugzilla.spinn3r.com \ --user=dkim --group=dkim --signature=dkim --daemonize --pidfile=/var/run/dkimproxy.out
For filtering inbound mail via DKIM edit the master.cf with something like this :
# Before-filter SMTP server. Receive mail from the network and
# pass it to the content filter on localhost port 10025.
smtp inet n - n - - smtpd
# After-filter SMTP server. Receive mail from the content filter on
# localhost port 10026.
127.0.0.1:10026 inet n - n - - smtpd
for outgoing DKIM edit the master.cf like this:
## outgoing dkim
submission inet n - n - - smtpd
dksign unix - - n - 10 smtp
127.0.0.1:10028 inet n - n - 10 smtpd
Do a postfix reload