sprocket i/o

thomas stromberg on technology, nature, and motorcycles

sprocket i/o header image 2

SystemTap: It’s like dtrace for linux, yo.

November 28th, 2007 · Comments

Spurred by Matty’s blog entry on Monitoring system utilization on Linux hosts, I decided to take a fresh look at SystemTap. SystemTap is a way for users to tap into the Linux kernel to see what’s going on at any given point, and resembles DTrace. One way to think about SystemTap is as a global strace with a built-in scripting language.

Installing SystemTap

Thankfully, SystemTap is now easy to load into most Linux distributions. For Ubuntu, run:

sudo apt-get install systemtap
sudo apt-get install linux-image-debug-generic
sudo ln -s /boot/vmlinux-debug-$(uname -r) /lib/modules/$(uname -r)/vmlinux

If you’re a Fedora user, try:

yum install systemtap kernel-devel
yum --enablerepo=fedora-debuginfo --enablerepo=updates-debuginfo install kernel-debuginfo

Who’s using my disk I/O?

One of my favorite (now retired) interview questions for UNIX administrators is “iostat shows a lot of write activity to a device. How do you determine what program is responsible for writing the most data to this device?”. Before SystemTap, the only way to answer this under Linux was to install kernel counter patches, or try to parse block_dump. Using the disktop script from the SystemTap Scripts & Tools page, it’s trivial to find this kind of information out:

% stap disktop.stp
[sudo] password for jars:

Wed Nov 28 03:08:51 2007 , Average: 204Kb/sec, Read:     919Kb, Write:    100Kb

     UID      PID     PPID                       CMD   DEVICE    T        BYTES
    1000    18668    18664               firefox-bin     sda1    R       633140
    1000     6900     6849                  metacity     sda1    R       229200
    1000    18668    18664               firefox-bin     sda1    W       103405
    1000    18664    18652            run-mozilla.sh     sda1    R        11196
    1000    18659    18658                        ls     sda1    R        10272
    1000     6903     6849               gnome-panel     sda1    R         9482
    1000    18652        1                   firefox     sda1    R         5960
    1000    18653    18652                   dirname     sda1    R         3290
    1000    18654    18652                  basename     sda1    R         3290
    1000    18655    18652                       pwd     sda1    R         3290

The disktop.stp script is about 60 lines long. This is a bit hefty to start with as your first script, but imagine how many lines of C it would take to get the same data!

gettimeofday() tracking

If you look at the most common system calls executed on your system (trivial to do in SystemTap, btw), you’ll note that gettimeofday() is called an awful lot. Someone a few weeks ago was complaining to me that a recent Linux kernel upgrade made a database drastically slower due to a change in how gettimeofday() was handled. Evidently this application called gettimeofday() hundreds of times a second. So, I decided to checkout what applications were calling gettimeofday() on my Ubuntu instance. Using the SystemTap Documentation, I was able to whip something up in a few minutes:

% stap -e 'probe kernel.function("sys_gettimeofday").return { \
printf("%d\t%s\n",pid(),execname()); }'

Running this shows the following output:

...
6959    gnome-screensav
6959    gnome-screensav
6959    gnome-screensav
6959    gnome-screensav
6609    Xorg
6609    Xorg
...

As you can see, I set a trap for when any gettimeofday() system call returns, and output the name and pid of the program. What I really wanted was a count of how many times an application called gettimeofday() over a short period of time. I only care about programs that call it more than once a second. So, what I did was turn it into a real script:

global progs

probe kernel.function("sys_gettimeofday").return {
  progs[execname()]++
}

probe timer.ms(5000) {
  exit()
}

probe end {
  log("PROGRAM\t\tCOUNT\tCOUNT/SEC")
  foreach (prog+ in progs) {
    rate = progs[prog] / 5
    if (rate > 1) {
      printf("%s\t%d\t%d\n", prog, progs[prog], rate)
    }
  }
}

This time things are a little more complex. We define a global associative array named “progs”, which is keyed off of the name of the program. Each time gettimeofday() returns, we increment progs[execname]. After 5 seconds, we exit, and display the list of programs and their associated counts that exceeded 1 call per second. This is what it looks like when executed:

% stap timeofday.stp
PROGRAM         COUNT   COUNT/SEC
Xorg    17      3
firefox-bin     21      4
gnome-panel     15      3
gnome-screensav 38      7
gnome-terminal  27      5
vmware-user     428     85

While vmware-user might call it quite a bit, it doesn’t look like anything on this system is abusing it. Lets move on.

Tracking bad stat() calls

If you’ve ever strace’d a python program, you’ll often notice dozen’s of stat() calls for library files which don’t exist. While stat() is cheap on local disks, it can be a real drag when NFS mounted resources are in use. For kicks, I decided to code something up that reported on stat() calls which returned an error. For whatever reason, getting return code’s for system calls is not very well documented under SystemTap. Here goes my bad-stats.stp script:

global last_stat

probe syscall.*stat {
  last_stat[pid()]=argstr
}
probe syscall.*stat.return {
  if (returnstr(1) != "0") {
    printf ("%s(%d) %s(%s) -> %s\n", execname(), pid(),
             probefunc(), last_stat[pid()], returnstr(1))
  }
}

probe timer.ms(10000) {
  exit()
}

If the code smells bad, it’s because it is. The reason I use *stat is because stat did not work (lstat? stat64?). The only time I can get the arguments to a system call appears to be at it’s inception. The only time I can get the error code to a system call appears to be at it’s return. So, I have an associative array that stores the last arguments passed to *stat() by the pid. Though unlikely, this assumption of the return corresponding to the last stat opened could possibly entertain a race condition.

I also could not figure out how to get an integer for the return code, which is why I compare returnstr(1) to “0″ instead of an integer comparison. In any case, it seems to work:

% stap bad-stats.stp
lsb_release(19298) sys_stat64("/usr/lib/python2.5/lib-tk/_struct", 0xbf9d84c4)
                            -> -2 (ENOENT)
lsb_release(19298) sys_stat64("/usr/lib/python2.5/lib-dynload/_struct", 0xbf9d84c4)
                            -> -2 (ENOENT)
lsb_release(19298) sys_stat64("/usr/bin/commands", 0xbf9e2154) -> -2 (ENOENT)
lsb_release(19298) sys_stat64("/usr/lib/python2.5/commands", 0xbf9e2154)
                            -> -2 (ENOENT)
sh(19299) sys_stat64("/sbin/apt-cache", 0xbfc2a78c) -> -2 (ENOENT)
sh(19299) sys_stat64("/bin/apt-cache", 0xbfc2a78c) -> -2 (ENOENT)
sh(19299) sys_stat64("/usr/sbin/apt-cache", 0xbfc2a78c) -> -2 (ENOENT)

I’ve only scratched the surface with SystemTap. For some real serious fun, checkout the histogram generation functions (@hist_linear, @hist_log) in the language reference.

Tags: technology

 

Trackbacks

(Trackback URL)

close Reblog this comment
blog comments powered by Disqus