David’s Blog

np-complete sysadmin

Posted in Uncategorized by david415 on January 29, 2013

Lately I’ve been asked to solve np-complete programming problems during interviews. This is interesting! It lead me to find Yizhan Sun’s excellent dissertation paper called “Complexity of System Configuration Management“. This paper blew my mind even though I cannot understand the formal mathematical proofs; the concepts are easy enough to grok… This inspired me to sign up for a Coursera class that covers mathematical proofs.

The paper illustrates many insights into the theory behind systems administration from a computational theory perspective. Read it!

Sysadmin interview programming questions

Posted in Uncategorized by david415 on January 10, 2013

This is a Python program that I should have written during a phone interview (shared online edit session); instead I wrote it afterwards. This program demonstrates an efficient functional pipeline of data without large intermediate storage. It’s a simple program. Read in a file of host names and return a subset of the hosts that can be connected to via HTTP.

#!/usr/bin/python
import urllib2
import itertools

def try_http_host(host):
  try:
      urllib2.urlopen("http://%s/" % host)
      return True
  except urllib2.HTTPError:
      return False

def main():

  for x in itertools.ifilter(try_http_host, itertools.imap(lambda x: x.strip(), open('hosts'))):
      print x

if __name__ == '__main__':
  main()

Simple interview programming tasks like this are meant to start a conversation; we can talk about what it would take to make it more efficient and what problems these solutions present us.

automating powerdns mysql with bcfg2

Posted in Systems Engineering / Unix Systems Operations by david415 on May 7, 2011

Briefly I will describe how I automate DNS updates to PowerDNS (with mysql backend) with bcfg2;
This pattern of using bcfg2 to verify contents of and automate updates to a mysql database could be applied to many different applications. I use it for PowerDNS.

the generic pattern

  • write a TGenshi template which generates a correct mysql dump
  • write a Bcfg2 probe to dump the mysql data in place of the aforementioned templated dump
  • to test data: run bcfg2 in dry-run mode; tests if the mysql dump generated by the probe differs from the TGenshi templated mysql dump
  • to update the mysql database: run bcfg2 in non-dry-run mode to update the mysql database via a BoundAction in the Bcfg2 Bundle

Here is what our Power DNS Bundle looks like:

<Bundle name='pdns-server' version='2.0'>

    <BoundPackage name='pdns-server' version='2.9.22-1~bpo50+1' type='deb'/>
    <BoundPackage name='pdns-backend-mysql' version='2.9.22-1~bpo50+1' type='deb'/>
    <BoundService name="pdns" status="on" type="deb"/>

    <Path name='/etc/powerdns/pdns.conf'/>
    <Path name='/etc/powerdns/pdns.d/pdns.local'/>
    <Path name='/etc/powerdns/pdns.d/master.conf'/>

    <Path name='/etc/powerdns/powerdns_reload.sh'/>
    <BoundAction name="reload_powerdns" timing="post" when="modified" command="bash /etc/powerdns/powerdns_reload.sh" status="ignore" />
    <Path name='/etc/powerdns/powerdns.sql'/>

</Bundle>

And here we have the reload_powerdns action script:

#!/bin/bash
mysql powerdns < /etc/powerdns/powerdns.sql
echo 'DROP TABLE IF EXISTS `domains_old`; DROP TABLE IF EXISTS `records_old`; RENAME TABLE domains TO domains_old, records TO records_old, domains_new TO domains, records_new TO records;'|mysql powerdns

As you can see from this reload script, our TGenshi template (/etc/powerdns/powerdns.sql) which generates the PowerDNS mysqldump has two tables; domains_new and records_new… which this reload script renames to doamins and records. Simple!

Now let’s take a look at our Powerdns probe:

#!/bin/bash
if [ -d /var/lib/mysql/powerdns ]; then
    mysqldump powerdns domains records | grep -v 'Server version' | grep -v '^-- Dump completed on' \
        | sed -e 's/`domains`/`domains_new`/g' | sed -e 's/`records`/`records_new`/g' | sed -e 's/),(/),\n(/g' > /etc/powerdns/powerdns.sql
fi

As you can see from this probe that I in fact lied about the TGenshi template creating an exact mysqldump. It does so with a newline character at the end of each tuple so that when we run the bcfg2 client in interactive mode (e.g. bcfg2 -qedI) we can see a readable diff showing us which line in the dump has changed…

And finally our TGenshi template which generates the mysqldump loads most of it’s data from several XML properties files as well as bcfg2 metadata…

using mincore to detect page cache efficiency of Mysql’s MyIsam

Posted in Uncategorized by david415 on November 21, 2010

We use MyIsam tables on our archive mysql database servers in an append only manner. This means that we’ve got data files that are continuously growing in size… In each mysql database archive shard there is a “hot” replica that is responsible for caching (in the Linux file system page cache) the last 4 hours worth of data. This latest 4 hours worth of data at the end of the file we call the “hot-window”. I’ve written a couple tools to help us detect if the hot-window is in page cache. I’ve added report_filechanges and pages_by_time to my python-ftools project on github.

Usage: report_filechanges [options]

Options:
-h, --help show this help message and exit
--hist-log-dir=HIST_LOG_DIR
historical log directory
--dir-to-monitor=MONITOR_DIR
director to monitor

report_filechanges is called from cron every five minutes; it produces a log file entry for each file in the Mysql MyIsam data directory. Each log entry contains the filename, the time (in Unix epoch seconds) and the file size. We later use these historical file sizes as offsets in the file. This gives us the ability to determine the boundary of the hot region.

Usage: pages_by_time [options]

Options:
-h, --help show this help message and exit
--evict Evict historical region of files
--summary Summarize page cache usage of file regions
--hist-log-dir=HIST_LOG_DIR
historical log directory
--evict-time=EVICT_TIME
eviction time

pages_by_time uses these logs files to summarize page cache usage (via the mincore system call) or to evict pages from file system page cache (via the fadvise system call).

Cassandra data storage performance statistics with inotify and fincore

Posted in Systems Engineering / Unix Systems Operations by david415 on September 3, 2010

We were interested in knowing which files were accessed (reads and seeks)
the most among the Cassandra data files (index, filter and data files)…
I wrote this simple Python program, inotify-access, to print these file read access statistics for a given time duration and directory. Find it in my github repository:
http://github.com/david415/inotify-access

This program makes use of the Linux kernel’s inotify system call.
If you run Debian/Ugabuga (I mean Ubuntu) then you’ll need to install the pynotify library: apt-get install python-pyinotify

To determine page cache usage of these files you can use fincore. I have forked the linux-ftools’s fincore to make the output more easily parsable at my github repository here:
http://github.com/david415/linux-ftools

Additionally I’ve written cassandra_pagecache_usage, a Python program that uses the mincore system call to report page cache usage for Cassandra data sets. Previously this program used to parse the output of linux-ftools’s fincore. However I have since switched to using the
the Python C extension fincore_ratio which returns a 2 tuples (cached pages, total pages). python-ftools is a linux-ftools port to Python C extensions; find it in my github repository here :
http://github.com/david415/python-ftools

I’ve written a Python version of fadvise for the commandline…
fadvise example usage:

Perhaps your cassandra node has been rebooted. You could “warm up” certain Column Families like this :

./fadvise -m willneed /mnt/var/cassandra/data/BunnyFufu/ForestActivity*

Find cassandra_pagecache_usage at my github repository here:
http://github.com/david415/cassandra-pagecache-usage

Usage: cassandra_pagecache_usage [options] <cassandra-data-directory>
 
Options:
  -h, --help            show this help message and exit
  -c, --columnfamily-summarize
                        Summarize cached Cassandra data on a per Column Family
                        basis.
  --exclude-filter      Exclude statistics for Cassandra Filter files.
  --exclude-index       Exclude statistics for Cassandra Index files.
  --exclude-data        Exclude statistics for Cassandra Data files.
 

example output:

my-cassandra-node:~/bin# PYTHONPATH=~/lib ./cassandra_pagecache_usage -c /mnt/var/cassandra/data/BunnyFufu/
Column Family    Bytes in FS page-cache    
ForestActivity   3712839680                
Indexes          2902822912                
AnimalIndex      2369015808                
AnimalCounts     1470619648                
Items            786214912                 
Activity         264978432                 
Animals          133816320                 
Hops             127442944                 

highly available mysql database topology

Posted in Systems Engineering / Unix Systems Operations by david415 on January 16, 2010

This Mysql database model I am about to describe could be applied to a distributed database built on top of Mysql.
In that sense I’m describing a single shard in a distributed database.
The shard consists of one master replicating to two slaves.

When I talk about database replicas (a master and slaves) crashing, I’m referring to any catastrophic failure that causes us to have to restore the entire dataset. Mysql replication has a number of race conditions that can result in data corruption when the Mysql node is rebooted, seg faulted, kernel paniced or otherwise crashed. Fiery explosions and semi automatic gun fire obviously also cause hardware failure and data loss. In the case of hardware failures we use one of the spare servers…

In this model we are prepared to handle any of these catastrophic failures to a single node in the shard.
Either the master or one of the two slaves crash. If a slave crashes we use the other slave as our data source to recover from.

If the master database crashes then we promote a slave to be master.
The old master becomes a slave; we recover it’s data from the other slave.
This procedure assumes all the slaves are at the same replication binary log position.
I imagine it would be nice to use Google’s Global Transaction ID patch (read about Global Transactions IDs here); That way if the slaves were behind in replication, the most up to date slave could be promoted
to be the master and the other slave could easily resume replication with the new master.

Without Google’s Global Transaction ID patch to Mysql we’d have to be more clever to be able to restore the slave from the newly promoted master since the new master will have a different set of transaction IDs than the old master. This shouldn’t be too hard with log_slave_updates enabled on all replicas and some simple subtraction of binary log positions.

Normally client queries would only be directed to the master database. If we were to scale reads to the slaves (this is a waste of cache!) then we may not have enough io capacity available to both restore the crashed node and service database queries within the SLA (service level agreement)… Therefore if we are going to scale reads to all three nodes we must first do load testing to determine the max queries per second we can safely send each node during normal operation and during recovering…

Without the benefit of load testing we’d have to send the read queries to only the master so that we can guarantee the io capacity of one replica for operational capacity for recovering from outages, rolling upgrades etc.

running linux with no swap

Posted in Systems Engineering / Unix Systems Operations by david415 on November 21, 2009

I do not ever want my laptop or any of the servers I maintain to swap.
For all my applications it is never a good idea to swap. RAM is cheap.
It is true that I don’t want the OOM killer to kill the wrong process… But I don’t worry about
that because I know how to use the oom_adj to prevent Mysql or sshd from getting killed.

According to a knowledgeable Linux kernel developer I spoke with if you are going to disable swap (as in swapoff -a) it is beneficial to recompile with the kernel config option SWAP=n. This avoids a performance problem; it’s been reported that kswapd will otherwise use lots of CPU. While your configuring your new kernel you may also want to disable tmpfs

# swapon -a
swapon: /dev/mapper/longhaul-swap_1: swapon failed: Function not implemented

Dear Linux,
Please do not swap ever again!
kthx bye

Mysql Innodb Hotcopy in Python

Posted in Systems Engineering / Unix Systems Operations by david415 on February 1, 2009

Lately for Spinn3r I’ve been writing programs in Python to automate database operations. This hotcopy script utilizes LVM. Previously we didn’t use LVM and so this script used to do an Innodb Freeze. The simple idea here is to restore a mysql shard replica’s data from another replica…

Look at the code here:
mysql cluster tools @ bitbucket.org

The cool feature this hotcopy script has is an undo/restore mechanism that
kicks in when an exception is caught. Basically I’ve implemented a nullary function queue.
With each state change, I push the reverse (or undo/restore operation) onto the queue. When an exception is caught the rollback() method repeatedly pops and executes the last nullary off the queue until there are none left.

What I’m calling a nullary function is merely a function call wrapped in a closure (although maybe that’s not correct because they say Python doesn’t have true closures) in the form of a lambda with zero arguments.

Python’s excellent exception handling makes this rollback() feature really useful because there are lots of moving parts at work that could break and throw an exception. The hotcopy process causes several state changes on the source replica such as : set single user mode, stop replication (if the source is a slave), take LVM snapshot, mount snapshot etc…

I don’t like LVM snapshots laying around so we can COW forever! Nor would I want a database server to remain in “single user mode”…

Three ways to extend this project:

  • Patch MySQL for faster InnoDB crash recovery.
  • A distributed/highly available persistent storage mechanism for the restore queue to allow a rollback even after the server running the hotcopy program, crashes.
  • A mechanism to invoke this program/API to fully automate crash recovery. A centralized design involving a voting protocol…
  • I’m sure many like Spinn3r have similar infrastructure goals. The above InnoDB modification is one of several Desirable Innodb Features that Spinn3r will probably throw down cash for…

    Check it out at bitbucket.org, my Mysql Innodb Hotcopy program.
    This is not even a release candidate… but if anyone wants to look at the code… feel free.

    Please leave your thoughts and comments.

    Ganglia rules… sort of.

    Posted in Systems Engineering / Unix Systems Operations by david415 on February 1, 2009

    At first Ganglia seems like an excellent tool.
    It seems to have an excellently efficient and reliable/highly available design.
    But actually Ganglia is very brittle and suffers numerous bugs and design flaws.

    Ganglia seems to be mostly self configuring… except for a few modules out there that need configuration parameters. If I need to use these modules I’ll modify them to be self configuring like I did with multidisk.py… Anyone who maintains a fair sized cluster knows with a little code it pays to be lazy.

    Ganglia seems to be way better than Cacti and the rest. Initially I was disappointed with the standard set of modules which didn’t allow me to monitor the throughput on two different ethernet interfaces. Perhaps the ganglia-developers don’t run multi-homed servers. Or maybe they just don’t care about how much throughput they use. I wrote a python module to graph usage for an internal and external interface because this will help us project how much we’d be paying a different data-center facility to host our cluster.

    Gilad Raphaelli wrote a really cool embedded python module for monitoring mysql metrics… It monitors 100 metrics including various Innodb buffers. Perfect! It also comes with one custom report for queries :


    mysql query report

    innodb transactionsl
    mysql threads

    I haven’t had time to look at it yet but Silvan Mühlemann wrote
    a custom report php script for mysql

    Admittedly I don’t know how best to use gmond’s python module interface even though I’ve written several modules already… I think its supposed to make it easy to write metric monitors with some embedded python code. But I like using gmetric more. It seems to be less code to produce the equivalent monitoring of metrics than the embedded python interface.

    Also I’ve noticed people trying to use the python interface in interesting ways… e.g. spawning metric collector threads that populate a cache etc. I was hoping to be able to write ganglia metric modules without having to think about threads, cache and race conditions. Couldn’t a module harness take care of these details?

    I’m working on gmetric-daemon which is a simple python forking daemon with a modular interface that calls gmetric (via system() or popen())… It’s not very memory efficient.
    Perhaps using gmetric in this way is a silly approach… But right now I think I like it.

    Here’s the gitHub repository for the work in progress I call

    Bugzilla Postfix e-mail integration

    Posted in Systems Engineering / Unix Systems Operations by david415 on November 25, 2008

    I got Postfix e-mail submissions to Bugzilla (3.0.5) working properly.
    Perhaps these notes of mine could save someone some trouble when attempting this.

    Certainly postfix could accomplish e-mail submissions via a custom transport using a pipe.
    I however decided to use a pipe in the /etc/aliases file; mine contains this important line :

    bug-submit: "|/var/www/bugz/email_in.pl -vvv 2>/tmp/emailin.log"

    Note that for troubleshooting I can take a look at email_in.pl’s STDERR in /tmp/emailin.log;

    Log in via the Bugzilla admin account and go to the Email section of the Parameters page.
    Change the mailfrom to match the above e-mail alias so that Bugzilla users can add a comment to a bug by replying to Bugzilla’s e-mails.


    I’m using SPF to verify sender e-mail addresses.
    Here’s part of my /etc/postfix/main.cf containing some configuration for SPF :

    alias_maps = hash:/etc/aliases

    smtpd_recipient_restrictions =
    permit_mynetworks
    reject_unauth_destination
    check_policy_service unix:private/policy-spf

    policyd-spf_time_limit = 3600

    and part of my /etc/postfix/master.cf :

    policy-spf unix - n n - - spawn
    user=nobody argv=/usr/bin/policyd-spf

    Next I got DomainKeys working.
    SPF and DomainKeys are especially important for this setup because
    Bugzilla will not be doing and e-mail spam filtering.
    All a spammer would have to do to submit annoying bugs into our Bugzilla
    system would be to forge an e-mail from a Bugzilla user’s e-mail address
    and send it to bug-submit@xxx.xxx… This is why I want SPF and DomainKeys fully
    operational… that way many forgery attempts will be rejected.

    The DKIM filters for inbound and outbound mail are started like this :

    
    /usr/local/dkimproxy/bin/dkimproxy.in --listen=127.0.0.1:10025 --relay=127.0.0.1:10026 \
    
    --user=dkim --group=dkim --daemonize --pidfile=/var/run/dkimproxy.in
    
    /usr/local/dkimproxy/bin/dkimproxy.out --listen=127.0.0.1:10027 --relay=127.0.0.1:10028 \
    
    --keyfile=/usr/local/dkimproxy/etc/private.key --selector=selector1 --domain=bugzilla.spinn3r.com \
    
    --user=dkim --group=dkim --signature=dkim --daemonize --pidfile=/var/run/dkimproxy.out

    For filtering inbound mail via DKIM edit the master.cf with something like this :

    # Before-filter SMTP server. Receive mail from the network and
    # pass it to the content filter on localhost port 10025.
    #
    smtp inet n - n - - smtpd
    -o smtpd_proxy_filter=127.0.0.1:10025
    -o smtpd_client_connection_count_limit=10
    # DKIM
    # After-filter SMTP server. Receive mail from the content filter on
    # localhost port 10026.
    127.0.0.1:10026 inet n - n - - smtpd
    -o smtpd_authorized_xforward_hosts=127.0.0.0/8
    -o smtpd_client_restrictions=
    -o smtpd_helo_restrictions=
    -o smtpd_sender_restrictions=
    -o smtpd_recipient_restrictions=permit_mynetworks,reject
    -o smtpd_data_restrictions=
    -o mynetworks=127.0.0.0/8
    -o receive_override_options=no_unknown_recipient_checks

    for outgoing DKIM edit the master.cf like this:


    ## outgoing dkim
    submission inet n - n - - smtpd
    -o smtpd_etrn_restrictions=reject
    -o content_filter=dksign:[127.0.0.1]:10027
    -o receive_override_options=no_address_mappings
    -o smtpd_recipient_restrictions=permit_mynetworks,reject
    dksign unix - - n - 10 smtp
    -o smtp_send_xforward_command=yes
    -o smtp_discard_ehlo_keywords=8bitmime
    127.0.0.1:10028 inet n - n - 10 smtpd
    -o content_filter=
    -o receive_override_options=no_unknown_recipient_checks,no_header_body_checks
    -o smtpd_helo_restrictions=
    -o smtpd_client_restrictions=
    -o smtpd_sender_restrictions=
    -o smtpd_recipient_restrictions=permit_mynetworks,reject
    -o mynetworks=127.0.0.0/8
    -o smtpd_authorized_xforward_hosts=127.0.0.0/8

    Do a postfix reload

    Follow

    Get every new post delivered to your Inbox.