-
Haskell Web Spider, Part 2: HXT, or I Was Promised There Would Be No Side Effects
2007-09-30 11:30 in /tech/haskell
Once HaXML proved unsuitable for validating XTHML, I turned my attention to HXT, the Haskell XML Toolkit. While the API for HaXML looked pretty similar to what I might have designed myself, HXT has more of a learning curve. In particular, it is based on the arrow computational structure. Like monads, arrows require learning new syntax and a new conceptual model. Unlike monads, where tutorials are a dime a dozen, there’s little out there to help you learn to use arrows effectively. This is complicated by the fact that HXT extensively extends the base Arrow definition, with little additional documentation.
(My one sentence explanation of arrows is that they model computation as pipeline elements which can be performed in sequence or in parallel.)
Despite the paucity of documentation, I got much further along with HXT. In fact, I have a complete working program, except for a “but” that would satisfy Sir Mix-a-Lot. I’m going to show most of this program, with annotations, then explain how things go wrong.
type MyArrow b c = IOStateArrow (Set.Set String) b c runUrl url = runX (constA url >>> setTraceLevel 1 >>> withOtherUserState Set.empty >>> split >>> checkUrl )
HXT throws attempts at purity and separation of concerns to the wind, and pushes everything it does into an
IOStateArrow
(underneath which are the IO and State monads). The state is separated into a system state and a user state, which is()
by default. Because I’m going to want to track URLs that I’ve crawled, I specify a Set of Strings for my state. This code shoves the seed url into an arrow usingconstA
, enables a low level of tracing, and sets up my initial state. With that setup done, we can start doing some real work. (The split will become clear in a second.)checkUrl :: MyArrow (String,String) String -- (url, base) checkUrl = clearErrStatus >>> ifA (first seenUrl) (fst ^>> traceString 1 ("Skipping (seen already) " ++) >>> none) (first markSeen >>> ifP isLocal checkLocalUrl checkRemoteUrl)
This function checks to see if the URL has been seen before, by checking it against the Set in the user state, and if it has we emit a debugging message and then stop. (
none
is a function on ListArrows that essentially produces an empty list, signifying no more work to be done.) If this is a new URL, we mark it as seen, then branch based on whether it is a local URL or a remote URL. This is where the split above comes in — we’ll be keeping track of the previous URL which this on was linked from, in order to figure out when we are leaving the original website.The most mysterious part of
checkUrl
is the first line. Originally I did not have this, and I observed that the spider would run for a while, but terminate before the whole site was crawled. After adding some additional debugging statements, I discovered something which I am inclined to consider a bug in HXT. After a document is encountered with errors in validation, something gets set in the global error state which causes all further attempts to read in a document to fail silently. So, after the spider found it’s first errors, it would terminate shortly thereafter, as it wasn’t managing to pick up any new URLs to crawl. The addition ofclearErrStatus
before each new fetch prevents this failure.checkLocalUrl :: MyArrow (String, String) String checkLocalUrl = constructLocalUrl >>> split >>> first ( traceString 0 ("Checking " ++) >>> readFromDocument [] >>> selectLinks >>> traceString 1 ("Found link: " ++) ) >>> checkUrl selectLinks :: ArrowXml a => a XmlTree String selectLinks = deep (isElem >>> hasName "a" >>> getAttrValue "href" >>> mkText) >>> getText
checkLocalUrl
expands any relative URLs, then reads in the resulting URL and selects out any hrefs from the document. The result is a list of new URLs to crawl, paired with the URL of this document, which we pass recursively back intocheckUrl
. What’s implicit in this code is thatreadFromDocument
validates the document by default, and in addition to fetching the document itself also fetches the DTD including any subparts, thus avoiding the difficulties I had with HaXML. Somewhat oddly, the library simply prints the validation errors, rather than returning them from the function, but that’s something I can live with in this application. (I think it would be possible to specify an alternate error handler if you wanted to store the errors for later processing.)checkRemoteUrl
is not terribly interesting, and for the purposes of this exposition, you can just consider it to be a synonym fornone
.checkRemoteUrl = none
This code seems to be correct, BUT..... I set it running on my website and it chugs along for a while. Then it hits a particular page (of perfectly valid XHTML, by the way), and starts validation and just never stops. I let it run for about 40 minutes with the CPU pegged before killing the process. Some further investigations with a pared down version of the document showed that it’s not in an infinite loop, but that it’s in some nasty, presumably exponential, blowup in the regular expression engine. The source of this blowup is somehow non-local: removing either half of a list of paragraphs eliminates the problem, removing an enclosing div eliminates the problem, etc. The author of HXT chose to implement his own regex engine based on the concept of “derivatives of regular expressions”, which I take it are interesting academically but, it would seem, perhaps not ideal practically speaking.
At the moment, this is where I’m stuck. I’m pretty comfortable with the program as it stands, but the library is letting me down. Fortunately, this problem seems more tractable than the HaXML problem. The decision is then whether to a) wait for the maintainer to fix it, b) try to fix it myself, c) try to replace the custom regex engine in HXT with one of the standard implementations, or d) rip out the regular expressions from the validation entirely and replace them with standard string functions which would be guaranteed to perform linearly. The biggest obstacle to working on this myself is not actually the code, but the fact that it takes nearly an hour to recompile the library on my computer.
Verdict: HXT has a somewhat steep learning curve, and the API is a little rough around the edges in places, particularly the state handing parts. There is desperate need for a better, more comprehensive, tutorial. This library was written as someone’s masters thesis, and the code has the look of something which has never seen serious code review (e.g., lots of typos, even in function names). I can see no good reason to reimplement regular expressions for this task. Actually, I can see no good reason to use regular expressions at all for this task. This portion of the code should be completely overhauled. On the other hand, it is easy to add debugging statements and to thread state through the computation, and the arrow API has a certain elegance to it.
To come: A resolution? Parallelization?
Leave a comment
Please use plain text only. No HTML tags are allowed.