Kevin's Weblog

The 90% Problem

2008-02-12 11:11 in /tech/haskell

Last night at the Portland Functional Programming Study Group, someone claimed that 90% of learning Haskell is figuring out monads. I replied back, “The problem is that the first 90% of learning Haskell is figuring out monads; the second 90% is figuring out arrows; the third 90% is figuring out monad transformers...”

In seriousness, we had a good discussion about the issues “practical programmers” have with using Haskell, which largely has to do with the need to learn seemingly irrelevant mathematical abstractions to get almost anything done. Sure, the difficulty of doing IO is vastly overblown. However, if you looks at my series on writing a web spider (1, 2, 3) you’ll see an example of this. HaXML failed, not because using monads is hard, but because properly separating pure and monadic tasks is hard. In particular, parsing a DTD is not a pure function! HXT required learning about arrows, then about derivatives of regular expressions. I ran into a problem with leaking file descriptors, which I assumed had something to do with laziness so I learned about strictness annotations, but in fact Paul Brown recently showed that it’s actually an error handling bug in Network.HTTP. I finally got a working program, which I then proceeded to never use because it takes too damn long to run and trying to figure out how to add concurrency just felt too overwhelming at that point, as it would undoubtedly require some new abstraction to learn, and I was just out of steam.

permalink · comments
Raganwald Reinvents Monads

2008-01-26 16:50 in /tech/haskell

Recently, Reg Braithwaite wrote about Object#andand, a method for chaining computations that can fail. It’s not clear from his post if he realizes it (maybe he considers it a completely uninteresting observation), but he’s become the latest person to reinvent monads. In Haskell, his proposed &&. operator is spelled >>= (and is significantly more general, as he’s really only created the Maybe monad).

(Although I originally claimed the second half had no similar connection, after thinking about it more, I now retract that. Haskell’s >> operator chains two computations, ignoring the result of the first; in essence performing it just for the “side effects”. Object#me and returning have a similar flavor, although they permit a far wider class of side effects.)

permalink · comments
Bad Advocacy, No Converts!

2007-11-29 14:30 in /tech/haskell
This post is going to be a little closer to a flame than I’m usually comfortable with, but I’m finding myself really frustrated by a recent rash of bad language advocacy by people who should really know better. This is by no means a problem limited to the Haskell community, but I’ve been noticing an increase from the Haskellers lately.

Today, dons posted a pair of articles benchmarking naive fibonacci implementations. Now, someone else started it, but this is still pretty silly. In post #1, we learn the shocking fact that static, compiled, finite-precision Haskell is faster than dynamic, interpreted, arbitrary-precision Python and Ruby at a heavily numeric and recursive task.

Post #2 is supposed to impress us with how easy it is to parallelize Haskell to use multiple cores. Unfortunately, the naive attempt to parallelize is actually slower than the original serial version. But if you:
1. implement the algorithm twice,
2. add a magic number apparently pulled out of someone’s nether regions, and
3. add not just some compile flags, but some runtime flags too,
then you can get a whopping 5% speed improvement with 2 cores, and almost a factor of two speedup with 4 cores, meanwhile burning about twice as many total CPU cycles!

Color me not impressed.

Meanwhile, elsewhere people have pointed out that using an intelligent algorithm yields a 1000-fold speed improvement. And, the better algorithm is actually shorter than the original naive implementation.

What’s the point here, since I’m not writing this just to pick on people? It’s that we shouldn’t give in to the temptation to participate in dumb benchmarks like this. Doing dumb stuff makes you look dumb, or worse(?) dishonest. And Haskell is about making you look smart, right? So, let’s not play the game, okay?
permalink · comments
Haskell Web Spider, Part 3: More HXT

2007-11-01 12:20 in /tech/haskell

In my last post about HXT I had gotten stuck at a performance problem in HXT that rendered it unusable. Since then, I’ve exchanged a number of emails with Uwe Schmidt, the maintainer. He found where the exponential blowup was happening in the regex engine and fixed that problem. With that fix, my spider ran for a bit longer, but eventually failed due to hitting the per-process limit on open file descriptors. I tried adding strictA in a couple places in the code, but it did not resolve the resource leak. Uwe claims this is a bug in Network.HTTP, and suggested the a_use_curl option to spawn an external curl program to do the fetching. While it sucks to be spawning hundreds of processes for this task, it did fix the resource leak.

With those problems out of the way, I was able to focus on some issues in my own program, like trying to validate JPG images as XML, or to fetch mailto: links. I’m now reasonably happy with the program, which you can see in the HXT/Practical section of the Haskell wiki.

The major area where this could still be improved is parallelization. Verifying about 700 pages and links on my site takes 45 minutes, during which the program is only doing something for about 8, while the rest is waiting on the network. It would definitely be a good exercise to learn more about the concurrency capabilities of Haskell, although the hidden system state in HXT makes me nervous about whether it’ll work at all. I’d probably want to do a couple simpler exercises in concurrent programming first, before attempting to parallelize this one.

I have a few remaining complaints / suggestions for HXT. One which I believe Uwe is already thinking about adding, is an option for adapting the parsing based on the content type in the response. Currently, you have to specify HTML or XML parsing in the call to readDocument. This is not terribly useful in an application like this one. It would be much nicer if HXT used XML parsing if the content type is XML, HTML parsing when it’s text/html, and complained on anything else (like, image/jpeg). Another frustration I had was that tracking parsing and validation errors to their source was very difficult in some cases. A missing end tag frequently doesn’t produce a parse error until much later in the document. The validator would catch this much earlier, but HXT does parsing and validation in two separate passes. One can insert missing end tags at the point of the parse error and then look at the resulting validation error, but the tree that the validator operates on doesn’t have any line or column information, so you can’t easily track a validation error down to a specific location in the source file. Presumably the nodes in the tree could be augmented with this data fairly simply, but the other shortcomings of the two-pass approach are undoubtedly much more difficult to fix.

permalink · comments
You know you’ve been doing too much Haskell...

2007-10-24 09:10 in /tech/haskell

...when you’re walking down the street and see this:

and your reaction is “Hey, there’s no ‘gt’ operator in Haskell!”

permalink · comments
Haskell Web Spider, Part 2: HXT, or I Was Promised There Would Be No Side Effects

2007-09-30 11:30 in /tech/haskell
Once HaXML proved unsuitable for validating XTHML, I turned my attention to HXT, the Haskell XML Toolkit. While the API for HaXML looked pretty similar to what I might have designed myself, HXT has more of a learning curve. In particular, it is based on the arrow computational structure. Like monads, arrows require learning new syntax and a new conceptual model. Unlike monads, where tutorials are a dime a dozen, there’s little out there to help you learn to use arrows effectively. This is complicated by the fact that HXT extensively extends the base Arrow definition, with little additional documentation.

(My one sentence explanation of arrows is that they model computation as pipeline elements which can be performed in sequence or in parallel.)

Despite the paucity of documentation, I got much further along with HXT. In fact, I have a complete working program, except for a “but” that would satisfy Sir Mix-a-Lot. I’m going to show most of this program, with annotations, then explain how things go wrong.
```
type MyArrow b c = IOStateArrow (Set.Set String) b c

runUrl url = runX (constA url
                   >>> setTraceLevel 1
                   >>> withOtherUserState Set.empty 
                   >>> split
                   >>> checkUrl
                  )
```
HXT throws attempts at purity and separation of concerns to the wind, and pushes everything it does into an IOStateArrow (underneath which are the IO and State monads). The state is separated into a system state and a user state, which is () by default. Because I’m going to want to track URLs that I’ve crawled, I specify a Set of Strings for my state. This code shoves the seed url into an arrow using constA, enables a low level of tracing, and sets up my initial state. With that setup done, we can start doing some real work. (The split will become clear in a second.)
```
checkUrl :: MyArrow (String,String) String        -- (url, base)
checkUrl = clearErrStatus
           >>>
           ifA (first seenUrl)
               (fst ^>> traceString 1 ("Skipping (seen already) " ++) >>> none)
               (first markSeen
                >>>
                ifP isLocal
                    checkLocalUrl
                    checkRemoteUrl)
```
This function checks to see if the URL has been seen before, by checking it against the Set in the user state, and if it has we emit a debugging message and then stop. (none is a function on ListArrows that essentially produces an empty list, signifying no more work to be done.) If this is a new URL, we mark it as seen, then branch based on whether it is a local URL or a remote URL. This is where the split above comes in — we’ll be keeping track of the previous URL which this on was linked from, in order to figure out when we are leaving the original website.

The most mysterious part of checkUrl is the first line. Originally I did not have this, and I observed that the spider would run for a while, but terminate before the whole site was crawled. After adding some additional debugging statements, I discovered something which I am inclined to consider a bug in HXT. After a document is encountered with errors in validation, something gets set in the global error state which causes all further attempts to read in a document to fail silently. So, after the spider found it’s first errors, it would terminate shortly thereafter, as it wasn’t managing to pick up any new URLs to crawl. The addition of clearErrStatus before each new fetch prevents this failure.
```
checkLocalUrl :: MyArrow (String, String) String
checkLocalUrl = constructLocalUrl
                >>> 
                split
                >>>
                first ( traceString 0 ("Checking " ++)
                        >>> readFromDocument []
                        >>> selectLinks
                        >>> traceString 1 ("Found link: " ++)
                      )
                >>>
                checkUrl

selectLinks :: ArrowXml a => a XmlTree String
selectLinks = deep (isElem 
                    >>> hasName "a"
                    >>> getAttrValue "href"
                    >>> mkText)
              >>> getText
```
checkLocalUrl expands any relative URLs, then reads in the resulting URL and selects out any hrefs from the document. The result is a list of new URLs to crawl, paired with the URL of this document, which we pass recursively back into checkUrl. What’s implicit in this code is that readFromDocument validates the document by default, and in addition to fetching the document itself also fetches the DTD including any subparts, thus avoiding the difficulties I had with HaXML. Somewhat oddly, the library simply prints the validation errors, rather than returning them from the function, but that’s something I can live with in this application. (I think it would be possible to specify an alternate error handler if you wanted to store the errors for later processing.)

checkRemoteUrl is not terribly interesting, and for the purposes of this exposition, you can just consider it to be a synonym for none.
```
checkRemoteUrl = none
```
This code seems to be correct, BUT..... I set it running on my website and it chugs along for a while. Then it hits a particular page (of perfectly valid XHTML, by the way), and starts validation and just never stops. I let it run for about 40 minutes with the CPU pegged before killing the process. Some further investigations with a pared down version of the document showed that it’s not in an infinite loop, but that it’s in some nasty, presumably exponential, blowup in the regular expression engine. The source of this blowup is somehow non-local: removing either half of a list of paragraphs eliminates the problem, removing an enclosing div eliminates the problem, etc. The author of HXT chose to implement his own regex engine based on the concept of “derivatives of regular expressions”, which I take it are interesting academically but, it would seem, perhaps not ideal practically speaking.

At the moment, this is where I’m stuck. I’m pretty comfortable with the program as it stands, but the library is letting me down. Fortunately, this problem seems more tractable than the HaXML problem. The decision is then whether to a) wait for the maintainer to fix it, b) try to fix it myself, c) try to replace the custom regex engine in HXT with one of the standard implementations, or d) rip out the regular expressions from the validation entirely and replace them with standard string functions which would be guaranteed to perform linearly. The biggest obstacle to working on this myself is not actually the code, but the fact that it takes nearly an hour to recompile the library on my computer.

Verdict: HXT has a somewhat steep learning curve, and the API is a little rough around the edges in places, particularly the state handing parts. There is desperate need for a better, more comprehensive, tutorial. This library was written as someone’s masters thesis, and the code has the look of something which has never seen serious code review (e.g., lots of typos, even in function names). I can see no good reason to reimplement regular expressions for this task. Actually, I can see no good reason to use regular expressions at all for this task. This portion of the code should be completely overhauled. On the other hand, it is easy to add debugging statements and to thread state through the computation, and the arrow API has a certain elegance to it.

To come: A resolution? Parallelization?
permalink · comments
Haskell Web Spider, Part 1: HaXML

2007-09-29 23:40 in /tech/haskell
I’ve been working on writing a validating web spider in Haskell, mostly because it’s something that would be useful for me. I want to be able to validate the markup on my entire web site and the external services for (X)HTML validation usually only do 100 pages before they stop crawling. I also want to learn Haskell better. Of course, I recognize that this type of task is not Haskell’s forte, and that’s exactly why I chose it. I want to push the edges of what the language is good at to understand its weaknesses as a general purpose language, as well as its strengths. Along the way, I’ve been learning a lot about monads, arrows, error handling, managing state, as well as the usefulness of the community and the general quality of the libraries.

There are three XML libraries for Haskell: HXML, HaXML, and HXT. HXML does not provide any validation functions, so it was immediately out of the running. HaXML seems relatively simple and was the next library I located, so I first attempted to use it for this task. I’ll talk about HXT in the next article in this series. This presentation is not strictly chronological, as I’ve bounced back and forth between HaXML and HXT when I’ve hit problems with one or the other.

HaXML is fairly minimalist. It doesn’t deal with fetching content off the net, and it doesn’t force you into the IO or any other monad. On the one hand, this does seem like good library design, but it also leads to some shortcomings. A big one is that it’s hard to debug relative to HXT because you can’t just toss in trace messages at will, and, for added frustration, the data types don’t derive from Show, so even once you get them to the outer layers of your program and have access to the IO monad, you still can’t easily display them.

Despite this, the naive approach to fetching, parsing, and validating a document seems simple. Use the simpleHTTP function from Network.URI to fetch a URL, then call xmlParse on the content, and then validate on the resulting DOM. (I’m skipping over handing all the error cases which can arise.)
```
checkUrl :: String -> IO [String]
checkUrl url = do contents <- getUrl url
                  case parse url contents of
                       Left msg -> return [msg]
                       Right (dtd, doc) -> return $ validate dtd doc

parse :: String -> String -> Either String (DocTypeDecl, Element)
parse file content = 
    case dtd of
          Left msg -> Left msg
          Right dtd' -> Right (dtd', root)
    where doc = xmlParse file content
          root = getRoot doc
          dtd = getDtd doc
```
Unfortunately, this fails miserably and reports that every single element and attribute in the document is invalid.

I puzzled over this for a while before I realized that while my XML document contained a DocTypeDecl within it, this object was useless for validating because it only contained the identifier information for the DTD, but none of the contents. Notice that we never actually fetched the DTD from the network in the above sequence. Once I realized this, I added a little more code to get the system identifier out of the DTD declaration, fetch that document, and run it through dtdParse, then validate using that DocTypeDecl.
```
checkUrl :: String -> IO [String]
checkUrl url = do contents <- getUrl url
                  case parse url contents of
                       Left msg -> return [msg]
                       Right (dtd,doc) -> getDtdAndValidate dtd doc

getDtdAndValidate :: DocTypeDecl -> Element -> IO [String]
getDtdAndValidate (DTD name mID _) doc = 
    do case mID of
            Nothing -> return ["No external ID for " ++ name]
            Just id -> fetchExternal id 
    where
        fetchExternal :: ExternalID -> IO [String]
        fetchExternal id =
            do dtd <- fetchDtd (systemLiteral id)
               return $ validate dtd doc

        systemLiteral (SYSTEM (SystemLiteral s)) = s
        systemLiteral (PUBLIC p (SystemLiteral s)) = s

        fetchDtd :: String -> IO DocTypeDecl
        fetchDtd dtd = do contents <- getUrl dtd
                          case dtdParse dtd contents of
                               Nothing -> error "No DTD"
                               Just dtd' -> return dtd'
```
This would probably work for many cases, but it fails for XHTML because the DTD is actually split across multiple files. As a result, when it encounters an ENTITY declaration from an external file, it fails with: *** Exception: xhtml-lat1.ent: openFile: does not exist (No such file or directory). This appears to be a dead-end. Short of parsing the DTD myself, finding the external references, fetching them, and doing the substitution; there seems to be no way around this problem.

Verdict: The API of HaXML looks more-or-less like you’d expect, although it would be really nice to derive from Show, at least for the major data types, and the addition of named fields would make things a little easier for users. Unfortunately, it seems like what I would think to be a common use case was not considered, rendering the library limited in utility. I also can’t speak to the quality of the filter portion of the library, since if I couldn’t validate my seed document, there was no point in extracting links from it to crawl further.

To come: adventures with HXT...
permalink · comments
One Month With Haskell

2007-03-04 16:10 in /tech/haskell
I’m now at about 1 month into learning Haskell. I’ve been moving a little slower than I might have, partly because I’ve been trying to make sure it doesn’t crowd out other activities and partly because in theory Chris is going to catch up with me and we’ll work on this together. It’s still the case that nothing is particularly frustrating me about the language, although I’m still a little shaky on some issues of precedence, and I’m sure my style and idiom leave something to be desired. (Okay, sometimes the compiler error messages are frustratingly hard to decipher. On the other hand, they tend to point out what would have been frustrating-to-debug errors, like those precedence mistakes I continue to make.)

I’ve been venturing into little programs involving IO, since everyone seems to think that’s the scary part of Haskell. Honestly, I didn’t find it that bad. Maybe the tutorials are just getting better. I was going to write something about how I’ve come to understand IO in Haskell, but Eric at Nub Games did such an excellent job with Haskell IO for Imperative Programmers that I find I have nothing to add.

So far, the largest program I’ve written (which is not very large) is a recursive directory lister. I figured this would be a good test that I understood how to work with IO, since it was going to involve sucking things into the IO monad recursively. As I was working on it, I saw a couple other blogs posts about writing the same sort of thing. I guess other people had the same thinking. Mine is a little different than some others, because it takes input from the command line, and can work with either an initial file or directory. This is it, if anyone cares: walker.hs. (I implemented my own, non-portable, path concatenation, since the standard libraries inexplicably don’t provide this. I got a reference to a FilePath library that handles this, but didn’t want to bother installing it for a simple toy exercise.)

I’ve been thinking about how or when I’ll declare that I’ve learned Haskell well enough to satisfy my 101 in 1001 goal. I jotted down some minimal requirements:
- Complete all the CS 11 labs. I’ve done the first 4 of the 5 posted so far. I assume there will be 9 or 10 total.
- I’ve finished reading A Gentle Introduction to Haskell, although I should re-read the Arrays chapter. My next reading goal is All About Monads. After that, probably a proper book.
- Work through all of the 99 Haskell Problems. I’ve done all but one of the first 40, and I’m trying to complete about 10 per week, although that might slow down as they get harder.
- I’m thinking I’m going to write a simple web spider to do validation and link checking on my website. I pseudocoded most of this a couple nights ago, but I need to go hunt down libraries, particularly for the HTML/XHTML parsing and validation.
Once I’ve done all that, I’ll re-assess and decide if there’s more I need to do to feel like I can say that I’ve learned the language to a reasonable degree.

Oh, and I’ve been occasionally hanging out on #haskell. I’m really impressed with how friendly and helpful the people there are. IRC that’s actually useful. Who knew it existed?
permalink · comments
Thoughts on One Week of Haskell

2007-02-12 18:20 in /tech/haskell
I’ve spent a modest amount of time in the last week starting to learn Haskell. Basically, I’ve read about half of A Gentle Introduction to Haskell and bits of various other tutorials, and I’ve coded solutions to the first 20 of 99 Haskell Problems, and completed Lab 1 of Caltech CS 11: Haskell Track. So far, nothing has been particularly strenuous. Actually, the toughest thing was figuring out how to address the problem of GHCI refusing to print my user-defined datatypes (and, as I was writing this and flipping through other materials, I realized that I did this the hard way, which is good because I was going to complain bitterly about it otherwise).

Some random thoughts and observations:
- Integers vs. Ints — This makes me sad
- I really like the property of the ML family of languages that almost all the time, once your code compiles, it runs correctly (assuming you actually understand the problem you’re trying to solve).
- I’m still a little unclear on the best way to unit test Haskell code. I haven’t found a good tutorial on this yet. I’m also a little annoyed that there’s no haddock for HUnit. It seems like a real sign of immaturity in a language to ship something in the standard distribution with no documentation.
- Soon, I should start reading one of the actual books about the language, but I haven’t decided which yet. Also, I need to come up with some real-life problems to try solving with Haskell, to get a better feel for its practical strengths and weaknesses.
permalink · comments

About