2009-05-02 22:00 in /meta
I took a bit of a blogging hiatus. Not particularly that I ran out of things to say, but rather there were technical difficulties. The short version, because I’m sure nobody cares, is this: last fall I retired my last non-Intel Mac. My blogging workflow involves Emacs PSGML mode, and makes use of a couple obscure corners of that package. Somehow on Intel Macs, one of those corners was busted. Don’t ask me why. I vaguely intended to try to debug the problem, but never got around to it. Last week, I decided I had something valuable enough to say that I would — the horrors — undertake to write valid XHTML without editor support. And, low and behold, everything worked perfectly again. I haven’t updated Emacs. I’m speculating something in some software update fixed something that was the root cause of this rather obscure problem. But, yeah, I really don’t know. But, I’m back!
2007-08-11 19:10 in /meta
I’d like to apologize for the spurious, empty entries which have been showing up in my RSS feed. I’ve been noticing this for some time, and made some attempts to understand why but came up empty. Today it finally bugged me enough to dig deeper and I found the problem.
A few months ago, I was doing some performance profiling and optimization of blosxom. I found a few major inefficiencies and made some changes to address them. Unfortunately, optimization can be dangerous and my caching implementation for readinglist had a race condition which meant that sometimes when a new book is added, the code that keeps it from being rendered if it has no body was skipped. I committed two sins which kept me from figuring out the problem sooner: I was working directly on the live site, and I didn’t check my changes into Subversion. So, when I tried to reproduce the problem on my test environment, I couldn’t. The indirect cause was another sin: over-optimization. I knew I wanted a cache to keep the same work from being done over and over during static page generation. The appropriate level of optimization would have been an in-memory cache for the life of the process. Instead, I created a persistent cache on the filesystem, leading to the race when a file is created but the stale cache from an old run is used.
For the moment, I’ve disabled the cache entirely. I think I can restore the more limited caching easily, but this time I’m going to try it on my test environment first!
2007-01-12 19:40 in /meta
The switch to feedback allowed me to easily close comments on articles after a couple months. This is the single most effective anti-spam measure I’ve tried. Before this change, I was getting about 300 spam comments a day, with Akismet achieving 98-99% effectiveness. This was leaving me with a handful of spams that I had to clean up manually each day.
With feedback preemptively rejecting any comment submissions on old posts, I now only get 1-2 spams a day that are submitted to Akismet, and it has caught all of them so far.
As an aside, this seems to suggest a shortcoming in Akismet’s learning algorithms. The Akismet query API includes the article permalink, and almost all of my spam was on 3 or 4 articles. Since I was submitting feedback to Akismet on each of the spams it missed, it really should have learned that comments on those articles were spam with extremely high probability. On the contrary, though, I never observed a decline in the false negative rate on those articles.
2007-01-09 22:50 in /meta
I run my blosxom installation in a hybrid mode, attempting to strike a balance between dynamicism and performance. Since I want comments to show up immediately, individual entries are fully dynamic. The index pages are static files, though. Every 15 minutes, I pull from Subversion and do a partial regeneration of the static pages, which recreates the pages that are obviously related to any new entries: the front page, and the appropriate category and date pages. However, this leaves things in a slightly inconsistent state across the site, since things like the displayed numbers of comments and numbers of postings in each category will not be completely up to date. Also, the picture of the day is not always the same on all pages. To clean this up, once a day I run a complete regeneration of all pages.
Recently, I started getting cron emails saying that the nightly full-rebuild process was being killed. I tried running it manually and the same thing happened. The return code indicated a ‘kill -9’ so I started to suspect that Dreamhost is running some automated process reaper, despite the fact that they claim to have no hard CPU usage limits and that they will contact you if something you are running is causing a problem. Furthermore, trying out different levels of nicing the process suggested a particularly lame scheme, where any process that uses more than 1 minute of CPU time gets killed.
A support request confirmed that there is a CPU monitoring process, but claimed that it’s more sophisticated than my conjecture and based on “sustained high CPU usage”. Of course, even a niced process can use lots of CPU if there’s nothing else trying to run at a higher priority. So, given that I was still getting reaped, their suggestion was to throw in some sleeps, which is lame.
At this point, I started firing up the profiler, while the support representative escalated the request to tier 2. What I found was fairly interesting. Running a request for a single index page under DProf came up with this:
%Time ExclSec CumulS #Calls sec/call Csec/c Name 22.1 0.115 0.307 1 0.1147 0.3068 File::Find::_find_dir 21.9 0.114 0.192 3531 0.0000 0.0001 entries_index::__ANON__(fb) 11.5 0.060 0.099 10 0.0060 0.0099 blosxom::BEGIN 9.64 0.050 0.085 3282 0.0000 0.0000 File::stat::stat 6.75 0.035 0.035 3282 0.0000 0.0000 File::stat::new
What this implies is that the entries subroutine was examining 3531 files when it crawled my data directory. Since I only have 361 blog posts, this was a little odd. On investigation, it turned out that Subversion was to blame. While CVS creates 3 book-keeping files per directory, SVN has many more. Adding a preprocessing callback to File::Find that pruned .svn directories yielded a vast improvement in the number of files considered. But, sadly, not enough improvement to let the big rebuild finish. Which is not really that surprising, since generating many pages is likely to have different bottlenecks than generating just one.
Fortunately, DProf dumps profiling information as it runs, so even though I was getting a ‘kill -9’, I could still gather data on as much of the process as completed. Thus, I found the big killer, my own damn fault:
%Time ExclSec CumulS #Calls sec/call Csec/c Name 38.1 36.51 48.720 3198 0.0114 0.0152 readinglist::filter
When I first wrote readinglist, I didn’t bother to do any caching. And, at the time that was okay because filter routines were only run once per execution of blosxom. However, we changed that in 2.0.2 and started running filter for each page that gets generated so that certain plugins would work properly in static mode. Unfortunately, a consequence of that is that inefficiencies in filter subroutines are now much more of a concern, as seen above. Fortunately, implementing a simple cache was, well, simple, and I quickly lopped off that 40%. (I’ve been intending to release a new version of readinglist with some other improvements; all the more reason now.)
At this point, I was still getting killed by the watcher process, though; when I got an email from the tier 2 engineer. He said he’d looked at the logs and looked at the source code, and discovered that in the latest version, the code that granted leniency to niced processes had been removed inadvertantly. It’s been fixed on my hosting machine, and should get rolled out to all the machines shortly.
Although my immediate problem was thus resolved, I may continue these investigations at some point. The current profile looks like this:
%Time ExclSec CumulS #Calls sec/call Csec/c Name 26.9 16.48 62.890 3198 0.0052 0.0197 blosxom::generate 19.0 11.66 11.660 3199 0.0036 0.0036 Storable::pretrieve 9.24 5.650 8.350 419544 0.0000 0.0000 blosxom::nice_date 5.69 3.480 3.480 3198 0.0011 0.0011 readinglist::cache_valid 4.48 2.740 2.740 168676 0.0000 0.0000 IO::File::open 4.41 2.700 2.700 419544 0.0000 0.0000 Time::localtime::ctime 3.70 2.260 14.320 3198 0.0007 0.0045 categories::prime_cache 3.43 2.100 4.330 71198 0.0000 0.0001 flavourdir::__ANON__(102) 3.07 1.880 1.880 1754 0.0011 0.0011 blosxom::__ANON__(11c) 1.23 0.750 0.820 4626 0.0002 0.0002 seemore::story 1.21 0.740 0.740 3198 0.0002 0.0002 potd::cache_valid 1.06 0.650 0.650 375042 0.0000 0.0000 UNIVERSAL::can 1.01 0.620 4.130 3198 0.0002 0.0013 readinglist::filter 0.64 0.390 8.190 5262 0.0001 0.0016 readinglist::display 0.47 0.290 0.000 3199 0.0001 0.0000 Storable::_retrieve
Clearly, there’s some low hanging fruit here. We’re spending 15% of our time formatting dates. Recall that I only have 361 posts, so for some reason we’re doing this about a thousand times more than we need to. The calls to Storable also seem excessive. I think the categories plugin is to blame for that; although the same issue with 2.0.2 and filter subroutines may apply here. I’m also a little uncertain why readinglist::cache_valid is taking so much longer than potd::cache_valid when the code is basically identical. But, these are all questions for another day.
2007-01-08 22:22 in /meta
I’ve switched from using my variant of the writeback plugin to Frank Heckler’s feedback plugin for comments. Overall, I prefer the design (it matches changes I’ve contemplated for writeback) and I like the added features like moderation, auto-closing of comments after a period of time, previewing, and Markdown support (although I haven’t enabled that yet). Readers will no doubt appreciate the fact that their paragraphs no longer get munged together. For the moment, I’m leaving moderation turned on, but I expect to turn it off shortly, once I’ve confirmed that everything is working smoothly, and that the spam concerns are well taken care of.
2006-07-19 23:18 in /meta
Blosxom 2.0.2 went up on Sourceforge a couple days ago. It’s got a couple bug fixes for static mode and more esoteric setups, like running blosxom as an SSI, and some aesthetic improvements. There seems to be some renewed momentum on the developer list, so hopefully we’ll see some more regular updates in the future.
2006-06-30 10:47 in /meta
While wading through the hundreds of comment spams that got through Akismet, I noticed a common pattern that explains why other people don’t seem to be seeing such a severe deluge. The commenter name in all cases was “ ”. At first I thought this was a bug where writebacks didn’t recognize pure whitespace as an empty name, but when I went and looked at the code I found that even though the default templates suggest that some fields are required and others optional, in reality the code doesn’t check to make sure anything is there.
The fix was relatively simple and I hacked it out last night and I haven’t seen any spam get through since (versus about 350 yesterday). The fixed version is released as writeback_akismet-0.0.2
2006-05-06 14:57 in /meta
There was a question recently on the blosxom mailing list about dealing with comment spam, and I asked if anyone had looked into integrating Akismet into any of the commenting plugins. There was no response, but I noticed that there is a Net::Akismet perl module already, so I decided to give it a try.
For the impatient, the result is here: writeback_akismet.
The evolution is a little interesting. I’ve realized for a while that the idea of orthogonal, independent plugins sort of falls apart for commenting. I’ve been using writebackplus and wbnotify, with slight modifications to deal with my use of date-based permalinks, and to incorporate my previous anti-spam technique (what’s my name?). One of the unsavory things about that setup was that I had to do the spam check in both modules. Now that I was going to require a network request to decide spamminess, that seemed unacceptable, so I merged the wbnotify code into the main writebacks module.
Shortly thereafter, I discovered a bug in wbnotify. Specifically, it only sends notifications for comments, but not for trackbacks. I always realized that the old technique didn’t prevent trackback spam, but I put off dealing with that because I thought I wasn’t getting any. I quickly started getting notifications of trackback spams, though, and grovelling over my writebacks directory, it turns out I’ve been getting a fair bit of it, for quite a while.
The integration of Akismet into the plugin was pretty simple. The only wrinkle was that Net::Akismet as distributed on CPAN will only install on Perl 5.8.5 or higher, although it actually works on 5.6 and above. For this reason, I actually copied the whole package into my plugin. The developer of Net::Akismet has indicated to me that he’ll fix the problem, but it hasn’t been updated yet.
On relative performance, my old technique never let through any camment spam (0% false negatives) and had an acceptably low false positive profile. All legitimate human commenters figured out how to post, if not on the first try then on the second. Of course, for trackback spam it had a 100% false negative rate: all spam got through. I have no idea if I’ve ever gotten a non-spam trackback. So far with Akismet I have seen no false positives, although my rate of legitimate comments is low enough that I can’t make a strong statement there. There’s definitely some false negatives, though. It seems like perhaps 5% of spam is getting through.
Long term, I would like to switch to using the feedback plugin for commenting. Overall, the design seems better, and I could piggyback the moderation system to send “spam” and “ham” messages back to Akimset on its mistakes. Currently, I’m trying to write some little command line utilities to do the same thing, but I haven’t gotten enough time to finish them yet.
2005-11-27 21:56 in /meta
A little earlier today I uploaded version 2.0.1 of Blosxom. This is basically just a couple bug fixes I’ve collected over the years. There should be more substantial improvements to come, but it seemed right to start small and get this stuff out of the way.
You can get it here
2005-08-06 23:24 in /meta
I’ve started getting hammered with comment spam. A small trickle over the last week, and then dozens today. Fortunately, I set up the “simplest thing that could possibly work” protection, requiring people to enter my name, as soon as it started and that seems to be holding back the flood. The annoying thing was that I kept getting emails from
wbnotifyon all of them, so I had to keep going to check if things were getting through. I fixed that, but it required some cut-and-paste hackery. I may have to give some more thought into how the writeback plugin ought to really work. It seems pretty clear that it needs it’s own plugin system to address issues like this.