-
Haskell Web Spider, Part 1: HaXML
2007-09-29 23:40 in /tech/haskell
I’ve been working on writing a validating web spider in Haskell, mostly because it’s something that would be useful for me. I want to be able to validate the markup on my entire web site and the external services for (X)HTML validation usually only do 100 pages before they stop crawling. I also want to learn Haskell better. Of course, I recognize that this type of task is not Haskell’s forte, and that’s exactly why I chose it. I want to push the edges of what the language is good at to understand its weaknesses as a general purpose language, as well as its strengths. Along the way, I’ve been learning a lot about monads, arrows, error handling, managing state, as well as the usefulness of the community and the general quality of the libraries.
There are three XML libraries for Haskell: HXML, HaXML, and HXT. HXML does not provide any validation functions, so it was immediately out of the running. HaXML seems relatively simple and was the next library I located, so I first attempted to use it for this task. I’ll talk about HXT in the next article in this series. This presentation is not strictly chronological, as I’ve bounced back and forth between HaXML and HXT when I’ve hit problems with one or the other.
HaXML is fairly minimalist. It doesn’t deal with fetching content off the net, and it doesn’t force you into the IO or any other monad. On the one hand, this does seem like good library design, but it also leads to some shortcomings. A big one is that it’s hard to debug relative to HXT because you can’t just toss in trace messages at will, and, for added frustration, the data types don’t derive from
Show
, so even once you get them to the outer layers of your program and have access to the IO monad, you still can’t easily display them.Despite this, the naive approach to fetching, parsing, and validating a document seems simple. Use the
simpleHTTP
function fromNetwork.URI
to fetch a URL, then callxmlParse
on the content, and thenvalidate
on the resulting DOM. (I’m skipping over handing all the error cases which can arise.)checkUrl :: String -> IO [String] checkUrl url = do contents <- getUrl url case parse url contents of Left msg -> return [msg] Right (dtd, doc) -> return $ validate dtd doc parse :: String -> String -> Either String (DocTypeDecl, Element) parse file content = case dtd of Left msg -> Left msg Right dtd' -> Right (dtd', root) where doc = xmlParse file content root = getRoot doc dtd = getDtd doc
Unfortunately, this fails miserably and reports that every single element and attribute in the document is invalid.
I puzzled over this for a while before I realized that while my XML document contained a
DocTypeDecl
within it, this object was useless for validating because it only contained the identifier information for the DTD, but none of the contents. Notice that we never actually fetched the DTD from the network in the above sequence. Once I realized this, I added a little more code to get the system identifier out of the DTD declaration, fetch that document, and run it throughdtdParse
, then validate using thatDocTypeDecl
.checkUrl :: String -> IO [String] checkUrl url = do contents <- getUrl url case parse url contents of Left msg -> return [msg] Right (dtd,doc) -> getDtdAndValidate dtd doc getDtdAndValidate :: DocTypeDecl -> Element -> IO [String] getDtdAndValidate (DTD name mID _) doc = do case mID of Nothing -> return ["No external ID for " ++ name] Just id -> fetchExternal id where fetchExternal :: ExternalID -> IO [String] fetchExternal id = do dtd <- fetchDtd (systemLiteral id) return $ validate dtd doc systemLiteral (SYSTEM (SystemLiteral s)) = s systemLiteral (PUBLIC p (SystemLiteral s)) = s fetchDtd :: String -> IO DocTypeDecl fetchDtd dtd = do contents <- getUrl dtd case dtdParse dtd contents of Nothing -> error "No DTD" Just dtd' -> return dtd'
This would probably work for many cases, but it fails for XHTML because the DTD is actually split across multiple files. As a result, when it encounters an ENTITY declaration from an external file, it fails with: *** Exception: xhtml-lat1.ent: openFile: does not exist (No such file or directory). This appears to be a dead-end. Short of parsing the DTD myself, finding the external references, fetching them, and doing the substitution; there seems to be no way around this problem.
Verdict: The API of HaXML looks more-or-less like you’d expect, although it would be really nice to derive from
Show
, at least for the major data types, and the addition of named fields would make things a little easier for users. Unfortunately, it seems like what I would think to be a common use case was not considered, rendering the library limited in utility. I also can’t speak to the quality of the filter portion of the library, since if I couldn’t validate my seed document, there was no point in extracting links from it to crawl further.To come: adventures with HXT...
Leave a comment
Please use plain text only. No HTML tags are allowed.