Kevin's Weblog 2007-09-29

Haskell Web Spider, Part 1: HaXML

2007-09-29 23:40 in /tech/haskell
I’ve been working on writing a validating web spider in Haskell, mostly because it’s something that would be useful for me. I want to be able to validate the markup on my entire web site and the external services for (X)HTML validation usually only do 100 pages before they stop crawling. I also want to learn Haskell better. Of course, I recognize that this type of task is not Haskell’s forte, and that’s exactly why I chose it. I want to push the edges of what the language is good at to understand its weaknesses as a general purpose language, as well as its strengths. Along the way, I’ve been learning a lot about monads, arrows, error handling, managing state, as well as the usefulness of the community and the general quality of the libraries.

There are three XML libraries for Haskell: HXML, HaXML, and HXT. HXML does not provide any validation functions, so it was immediately out of the running. HaXML seems relatively simple and was the next library I located, so I first attempted to use it for this task. I’ll talk about HXT in the next article in this series. This presentation is not strictly chronological, as I’ve bounced back and forth between HaXML and HXT when I’ve hit problems with one or the other.

HaXML is fairly minimalist. It doesn’t deal with fetching content off the net, and it doesn’t force you into the IO or any other monad. On the one hand, this does seem like good library design, but it also leads to some shortcomings. A big one is that it’s hard to debug relative to HXT because you can’t just toss in trace messages at will, and, for added frustration, the data types don’t derive from Show, so even once you get them to the outer layers of your program and have access to the IO monad, you still can’t easily display them.

Despite this, the naive approach to fetching, parsing, and validating a document seems simple. Use the simpleHTTP function from Network.URI to fetch a URL, then call xmlParse on the content, and then validate on the resulting DOM. (I’m skipping over handing all the error cases which can arise.)
```
checkUrl :: String -> IO [String]
checkUrl url = do contents <- getUrl url
                  case parse url contents of
                       Left msg -> return [msg]
                       Right (dtd, doc) -> return $ validate dtd doc

parse :: String -> String -> Either String (DocTypeDecl, Element)
parse file content = 
    case dtd of
          Left msg -> Left msg
          Right dtd' -> Right (dtd', root)
    where doc = xmlParse file content
          root = getRoot doc
          dtd = getDtd doc
```
Unfortunately, this fails miserably and reports that every single element and attribute in the document is invalid.

I puzzled over this for a while before I realized that while my XML document contained a DocTypeDecl within it, this object was useless for validating because it only contained the identifier information for the DTD, but none of the contents. Notice that we never actually fetched the DTD from the network in the above sequence. Once I realized this, I added a little more code to get the system identifier out of the DTD declaration, fetch that document, and run it through dtdParse, then validate using that DocTypeDecl.
```
checkUrl :: String -> IO [String]
checkUrl url = do contents <- getUrl url
                  case parse url contents of
                       Left msg -> return [msg]
                       Right (dtd,doc) -> getDtdAndValidate dtd doc

getDtdAndValidate :: DocTypeDecl -> Element -> IO [String]
getDtdAndValidate (DTD name mID _) doc = 
    do case mID of
            Nothing -> return ["No external ID for " ++ name]
            Just id -> fetchExternal id 
    where
        fetchExternal :: ExternalID -> IO [String]
        fetchExternal id =
            do dtd <- fetchDtd (systemLiteral id)
               return $ validate dtd doc

        systemLiteral (SYSTEM (SystemLiteral s)) = s
        systemLiteral (PUBLIC p (SystemLiteral s)) = s

        fetchDtd :: String -> IO DocTypeDecl
        fetchDtd dtd = do contents <- getUrl dtd
                          case dtdParse dtd contents of
                               Nothing -> error "No DTD"
                               Just dtd' -> return dtd'
```
This would probably work for many cases, but it fails for XHTML because the DTD is actually split across multiple files. As a result, when it encounters an ENTITY declaration from an external file, it fails with: *** Exception: xhtml-lat1.ent: openFile: does not exist (No such file or directory). This appears to be a dead-end. Short of parsing the DTD myself, finding the external references, fetching them, and doing the substitution; there seems to be no way around this problem.

Verdict: The API of HaXML looks more-or-less like you’d expect, although it would be really nice to derive from Show, at least for the major data types, and the addition of named fields would make things a little easier for users. Unfortunately, it seems like what I would think to be a common use case was not considered, rendering the library limited in utility. I also can’t speak to the quality of the filter portion of the library, since if I couldn’t validate my seed document, there was no point in extracting links from it to crawl further.

To come: adventures with HXT...
permalink · comments

Leave a comment

Please use plain text only. No HTML tags are allowed.

About

Kevin's Weblog

2007-09-29

Haskell Web Spider, Part 1: HaXML

Leave a comment

About