sprocket i/o

thomas stromberg on technology, nature, and motorcycles

sprocket i/o header image 2

Clever Assumptions in Python

February 26th, 2009 · Comments

Costumes at the Aalst Carnival Parade

Today I had one of those experiences that shows yet again that clever code can sometimes make the worst assumptions. I’ve got this project at work, which is a web application that generates pretty graphs and tables based on SQL queries to a handful of databases. The users have a fair amount of control over the queries, mostly to handle different geographic locations, date ranges, and how to group the results.

This is a long geeky post, so I’ve hid the meat just behind the cut…

Caching, Take 1 (memcached)

The fastest queries ran on a data warehousing system, but the slowest often directed to an overloaded set of MySQL replicas. Depending on what the user selects on a given page, certain queries can be take as long as a minute to execute! So, like any basic web application, we cache the results using memcached. memcached is a wicked-fast in-memory object caching system that was designed for Livejournal, and used by Facebook, Wikipedia, and dozens of other big names. The fast SQL queries had a TTL (time-to-live) of 2 hours, the slower queries had a TTL of up to 24 hours. This insured that if the same slow query was generated twice in a day by different users, it would serve the second user instantly.

Caching, Take 2 (memcached + crontab)

The next obvious step was to make a simple background task that made sure to hit the default page views a few times a day, to make sure the cache was fresh. This helped, but there was still a race condition: If the data for a certain query expires at 8:01am, and your background task to refresh the data runs on the hour, at 8:00am it will see that the query is in cache, and for the next 59 minutes, the user will have to wait for the page to generate. We ran the refresh step once every 2 hours, which made it even worse.

Caching, Take 3 (adding refresh)

Often when I think about a complex systems problem, I think about how well-known internet software or protocols handle it. When I wrote a message queueing engine, I based it’s structure and design on Sendmail and Postfix. When I think about caching data, I think about HTTP and DNS. I recalled that DNS zones have two different SOA fields: refresh, and expiry. The expiry in my case was the TTL. In DNS, the refresh is used only by the slaves to refresh from the master in the background. In my case, I was going to implement refresh for the end-user: deliver stale data, but refresh it in the background.

For example: There is a known slow query who has a refresh setting 8 hours, and a TTL of 24 hours. If another user, or the background refresh task hits this page at 8:05am, it will show the user data that is 8 hours old. In the background, it will force a refresh of the requested data, because it may be used again soon.

This method eliminated the race condition, and effectively caused it so only new and unique queries had a cache miss.

Backround-Refresh Pseudo-code

In order to implement the background refresh, I used Python’s ‘threading’ library. It’s one of the nicest threading implementations I have used actually. My pseudo-code looked like this:

MAX_THREAD_COUNT = 8

data = self._GetCachedData(key)
if data:
  age = datetime.datetime.now() - data.generated_date
  if age > refresh and threading.activeCount() < MAX_THREAD_COUNT:
    log_info("Refreshing data for %s" % obj)
    RefreshDataInBackgroundThread().start()
  else:
    self.GenerateAndCacheData(obj)

What went wrong?

The background refresh threads worked great for six months. Then users began to complain that the site was slow. I first assumed that they were running unusual queries that were always delivered uncached. Eventually, I capitulated and looked at the logs. There was no line that said "Refreshing data". The module tests ran fine, why weren't we doing any refreshes? My first thought was that we weren't writing info messages to the log file, but we were. The next I did was add some better error messages:

  if age > refresh:
   if threading.activeCount() < MAX_THREAD_COUNT:
      log_info("Refreshing data for %s (threads=%s)" % (obj, threading.activeCount())
      RefreshDataInBackgroundThread().start()
  else:
      log_info("Could not refresh %s (%s of %s threads in use)" %
                      (obj, threading.activeCount(), MAX_THREAD_COUNT))

When I ran this code in my development environment, it soon became apparent that something fishy was going on. threading.activeCount() was reporting that I had 9 threads running on the first hit, more than the 0 I was expecting, and more than the MAX_THREAD_COUNT I had implemented to prevent a denial of service attack. This code used to work just fine -- what happened? What the heck were all of those threads? I checked the python documentation:

>>> import threading
>>> help(threading.activeCount)
Help on function activeCount in module threading:

activeCount()

That wasn’t very useful, so I googled for “python threading”, and got the Python 2.6.1 threading documentation, which said:

activeCount():: Return the number of Thread objects currently alive. The returned count is equal to the length of the list returned by enumerate().

That sounded right to me. Why did I have 9 threads when I did not actually start any?

Enter dummyThreads

After reading the reference to threading.enumerate(), I added it to the log_info invocation. What I saw were mostly dummyThreads, but also a main thread. I wasn’t using threading anywhere else in the web application. Looking at the documentation for enumerate, I realized that all was not as it seemed:

enumerate(): Return a list of all Thread objects currently alive. The list includes daemonic threads, dummy thread objects created by current_thread(), and the main thread. It excludes terminated threads and threads that have not yet been started.

A little bit more googling revealed the truth of these dummy threads, and I soon groked it’s goodness:

There is the possibility that “dummy thread objects” are created. These are thread objects corresponding to “alien threads”. These are threads of control started outside the threading module, such as directly from C code. Dummy thread objects have limited functionality; they are always considered alive, active, and daemonic, and cannot be join()ed. They are never deleted, since it is impossible to detect the termination of alien threads.

The light bulb went off in my head: these dummy threads were actually webserver threads. This now explained why this problem had suddenly come up: The web application was originally configured to only use 5 threads, but as it got more popular, we increased it to 16 threads. With 5 dummy threads and one MainThread, there was room still for two refresh threads, which was honestly more than enough. Why hadn’t we seen this earlier?

The fix is in..

So, our logic in counting the number of refresh threads was apparently wrong. Reading up on the documentation, I discovered that you could name threads, and with that, using enumerate() became useful. I added a function:

def refreshThreadCount():
  thread_names = [ x.getName() for x in threading.enumerate() ]
  return thread_names.count('RefreshThread')

I changed all occurances of threading.activeCount() to use this function instead, wrote some comprehensive tests, and all was well. I felt good, but a bit stupid for not having caught this earlier.

The future

The periodic task that hits the default page views is still a bad hack. Instead, I should be looking at what queries are the most popular and time consuming, and make sure those are the most fresh. This in combined with moving more of the slow queries to make use of the data warehouses, should make things run a bit more smoothly.

Tags: other

 

Trackbacks

(Trackback URL)

close Reblog this comment
blog comments powered by Disqus