-
Haskell Web Spider, Part 3: More HXT
2007-11-01 12:20 in /tech/haskell
In my last post about HXT I had gotten stuck at a performance problem in HXT that rendered it unusable. Since then, I’ve exchanged a number of emails with Uwe Schmidt, the maintainer. He found where the exponential blowup was happening in the regex engine and fixed that problem. With that fix, my spider ran for a bit longer, but eventually failed due to hitting the per-process limit on open file descriptors. I tried adding
strictA
in a couple places in the code, but it did not resolve the resource leak. Uwe claims this is a bug inNetwork.HTTP
, and suggested thea_use_curl
option to spawn an external curl program to do the fetching. While it sucks to be spawning hundreds of processes for this task, it did fix the resource leak.With those problems out of the way, I was able to focus on some issues in my own program, like trying to validate JPG images as XML, or to fetch
mailto:
links. I’m now reasonably happy with the program, which you can see in the HXT/Practical section of the Haskell wiki.The major area where this could still be improved is parallelization. Verifying about 700 pages and links on my site takes 45 minutes, during which the program is only doing something for about 8, while the rest is waiting on the network. It would definitely be a good exercise to learn more about the concurrency capabilities of Haskell, although the hidden system state in HXT makes me nervous about whether it’ll work at all. I’d probably want to do a couple simpler exercises in concurrent programming first, before attempting to parallelize this one.
I have a few remaining complaints / suggestions for HXT. One which I believe Uwe is already thinking about adding, is an option for adapting the parsing based on the content type in the response. Currently, you have to specify HTML or XML parsing in the call to
readDocument
. This is not terribly useful in an application like this one. It would be much nicer if HXT used XML parsing if the content type is XML, HTML parsing when it’s text/html, and complained on anything else (like, image/jpeg). Another frustration I had was that tracking parsing and validation errors to their source was very difficult in some cases. A missing end tag frequently doesn’t produce a parse error until much later in the document. The validator would catch this much earlier, but HXT does parsing and validation in two separate passes. One can insert missing end tags at the point of the parse error and then look at the resulting validation error, but the tree that the validator operates on doesn’t have any line or column information, so you can’t easily track a validation error down to a specific location in the source file. Presumably the nodes in the tree could be augmented with this data fairly simply, but the other shortcomings of the two-pass approach are undoubtedly much more difficult to fix.
Leave a comment
Please use plain text only. No HTML tags are allowed.