2007-01-09 22:50 in /meta
I run my blosxom installation in a hybrid mode, attempting to strike a balance between dynamicism and performance. Since I want comments to show up immediately, individual entries are fully dynamic. The index pages are static files, though. Every 15 minutes, I pull from Subversion and do a partial regeneration of the static pages, which recreates the pages that are obviously related to any new entries: the front page, and the appropriate category and date pages. However, this leaves things in a slightly inconsistent state across the site, since things like the displayed numbers of comments and numbers of postings in each category will not be completely up to date. Also, the picture of the day is not always the same on all pages. To clean this up, once a day I run a complete regeneration of all pages.
Recently, I started getting cron emails saying that the nightly full-rebuild process was being killed. I tried running it manually and the same thing happened. The return code indicated a ‘kill -9’ so I started to suspect that Dreamhost is running some automated process reaper, despite the fact that they claim to have no hard CPU usage limits and that they will contact you if something you are running is causing a problem. Furthermore, trying out different levels of nicing the process suggested a particularly lame scheme, where any process that uses more than 1 minute of CPU time gets killed.
A support request confirmed that there is a CPU monitoring process, but claimed that it’s more sophisticated than my conjecture and based on “sustained high CPU usage”. Of course, even a niced process can use lots of CPU if there’s nothing else trying to run at a higher priority. So, given that I was still getting reaped, their suggestion was to throw in some sleeps, which is lame.
At this point, I started firing up the profiler, while the support representative escalated the request to tier 2. What I found was fairly interesting. Running a request for a single index page under DProf came up with this:
%Time ExclSec CumulS #Calls sec/call Csec/c Name 22.1 0.115 0.307 1 0.1147 0.3068 File::Find::_find_dir 21.9 0.114 0.192 3531 0.0000 0.0001 entries_index::__ANON__(fb) 11.5 0.060 0.099 10 0.0060 0.0099 blosxom::BEGIN 9.64 0.050 0.085 3282 0.0000 0.0000 File::stat::stat 6.75 0.035 0.035 3282 0.0000 0.0000 File::stat::new
What this implies is that the entries subroutine was examining 3531 files when it crawled my data directory. Since I only have 361 blog posts, this was a little odd. On investigation, it turned out that Subversion was to blame. While CVS creates 3 book-keeping files per directory, SVN has many more. Adding a preprocessing callback to File::Find that pruned .svn directories yielded a vast improvement in the number of files considered. But, sadly, not enough improvement to let the big rebuild finish. Which is not really that surprising, since generating many pages is likely to have different bottlenecks than generating just one.
Fortunately, DProf dumps profiling information as it runs, so even though I was getting a ‘kill -9’, I could still gather data on as much of the process as completed. Thus, I found the big killer, my own damn fault:
%Time ExclSec CumulS #Calls sec/call Csec/c Name 38.1 36.51 48.720 3198 0.0114 0.0152 readinglist::filter
When I first wrote readinglist, I didn’t bother to do any caching. And, at the time that was okay because filter routines were only run once per execution of blosxom. However, we changed that in 2.0.2 and started running filter for each page that gets generated so that certain plugins would work properly in static mode. Unfortunately, a consequence of that is that inefficiencies in filter subroutines are now much more of a concern, as seen above. Fortunately, implementing a simple cache was, well, simple, and I quickly lopped off that 40%. (I’ve been intending to release a new version of readinglist with some other improvements; all the more reason now.)
At this point, I was still getting killed by the watcher process, though; when I got an email from the tier 2 engineer. He said he’d looked at the logs and looked at the source code, and discovered that in the latest version, the code that granted leniency to niced processes had been removed inadvertantly. It’s been fixed on my hosting machine, and should get rolled out to all the machines shortly.
Although my immediate problem was thus resolved, I may continue these investigations at some point. The current profile looks like this:
%Time ExclSec CumulS #Calls sec/call Csec/c Name 26.9 16.48 62.890 3198 0.0052 0.0197 blosxom::generate 19.0 11.66 11.660 3199 0.0036 0.0036 Storable::pretrieve 9.24 5.650 8.350 419544 0.0000 0.0000 blosxom::nice_date 5.69 3.480 3.480 3198 0.0011 0.0011 readinglist::cache_valid 4.48 2.740 2.740 168676 0.0000 0.0000 IO::File::open 4.41 2.700 2.700 419544 0.0000 0.0000 Time::localtime::ctime 3.70 2.260 14.320 3198 0.0007 0.0045 categories::prime_cache 3.43 2.100 4.330 71198 0.0000 0.0001 flavourdir::__ANON__(102) 3.07 1.880 1.880 1754 0.0011 0.0011 blosxom::__ANON__(11c) 1.23 0.750 0.820 4626 0.0002 0.0002 seemore::story 1.21 0.740 0.740 3198 0.0002 0.0002 potd::cache_valid 1.06 0.650 0.650 375042 0.0000 0.0000 UNIVERSAL::can 1.01 0.620 4.130 3198 0.0002 0.0013 readinglist::filter 0.64 0.390 8.190 5262 0.0001 0.0016 readinglist::display 0.47 0.290 0.000 3199 0.0001 0.0000 Storable::_retrieve
Clearly, there’s some low hanging fruit here. We’re spending 15% of our time formatting dates. Recall that I only have 361 posts, so for some reason we’re doing this about a thousand times more than we need to. The calls to Storable also seem excessive. I think the categories plugin is to blame for that; although the same issue with 2.0.2 and filter subroutines may apply here. I’m also a little uncertain why readinglist::cache_valid is taking so much longer than potd::cache_valid when the code is basically identical. But, these are all questions for another day.