2008-02-12 11:11 in /tech/haskell
Last night at the Portland Functional Programming Study Group, someone claimed that 90% of learning Haskell is figuring out monads. I replied back, “The problem is that the first 90% of learning Haskell is figuring out monads; the second 90% is figuring out arrows; the third 90% is figuring out monad transformers...”
In seriousness, we had a good discussion about the issues “practical programmers” have with using Haskell, which largely has to do with the need to learn seemingly irrelevant mathematical abstractions to get almost anything done. Sure, the difficulty of doing IO is vastly overblown. However, if you looks at my series on writing a web spider (1, 2, 3) you’ll see an example of this. HaXML failed, not because using monads is hard, but because properly separating pure and monadic tasks is hard. In particular, parsing a DTD is not a pure function! HXT required learning about arrows, then about derivatives of regular expressions. I ran into a problem with leaking file descriptors, which I assumed had something to do with laziness so I learned about strictness annotations, but in fact Paul Brown recently showed that it’s actually an error handling bug in Network.HTTP. I finally got a working program, which I then proceeded to never use because it takes too damn long to run and trying to figure out how to add concurrency just felt too overwhelming at that point, as it would undoubtedly require some new abstraction to learn, and I was just out of steam.
2008-01-26 16:50 in /tech/haskell
Recently, Reg Braithwaite wrote about Object#andand, a method for chaining computations that can fail. It’s not clear from his post if he realizes it (maybe he considers it a completely uninteresting observation), but he’s become the latest person to reinvent monads. In Haskell, his proposed
&&.operator is spelled
>>=(and is significantly more general, as he’s really only created the
(Although I originally claimed the second half had no similar connection, after thinking about it more, I now retract that. Haskell’s
>>operator chains two computations, ignoring the result of the first; in essence performing it just for the “side effects”.
returninghave a similar flavor, although they permit a far wider class of side effects.)
2007-11-29 14:30 in /tech/haskell
This post is going to be a little closer to a flame than I’m usually comfortable with, but I’m finding myself really frustrated by a recent rash of bad language advocacy by people who should really know better. This is by no means a problem limited to the Haskell community, but I’ve been noticing an increase from the Haskellers lately.
Today, dons posted a pair of articles benchmarking naive fibonacci implementations. Now, someone else started it, but this is still pretty silly. In post #1, we learn the shocking fact that static, compiled, finite-precision Haskell is faster than dynamic, interpreted, arbitrary-precision Python and Ruby at a heavily numeric and recursive task.
Post #2 is supposed to impress us with how easy it is to parallelize Haskell to use multiple cores. Unfortunately, the naive attempt to parallelize is actually slower than the original serial version. But if you:
- implement the algorithm twice,
- add a magic number apparently pulled out of someone’s nether regions, and
- add not just some compile flags, but some runtime flags too,
then you can get a whopping 5% speed improvement with 2 cores, and almost a factor of two speedup with 4 cores, meanwhile burning about twice as many total CPU cycles!
Color me not impressed.
Meanwhile, elsewhere people have pointed out that using an intelligent algorithm yields a 1000-fold speed improvement. And, the better algorithm is actually shorter than the original naive implementation.
What’s the point here, since I’m not writing this just to pick on people? It’s that we shouldn’t give in to the temptation to participate in dumb benchmarks like this. Doing dumb stuff makes you look dumb, or worse(?) dishonest. And Haskell is about making you look smart, right? So, let’s not play the game, okay?
2007-11-01 12:20 in /tech/haskell
In my last post about HXT I had gotten stuck at a performance problem in HXT that rendered it unusable. Since then, I’ve exchanged a number of emails with Uwe Schmidt, the maintainer. He found where the exponential blowup was happening in the regex engine and fixed that problem. With that fix, my spider ran for a bit longer, but eventually failed due to hitting the per-process limit on open file descriptors. I tried adding
strictAin a couple places in the code, but it did not resolve the resource leak. Uwe claims this is a bug in
Network.HTTP, and suggested the
a_use_curloption to spawn an external curl program to do the fetching. While it sucks to be spawning hundreds of processes for this task, it did fix the resource leak.
With those problems out of the way, I was able to focus on some issues in my own program, like trying to validate JPG images as XML, or to fetch
mailto:links. I’m now reasonably happy with the program, which you can see in the HXT/Practical section of the Haskell wiki.
The major area where this could still be improved is parallelization. Verifying about 700 pages and links on my site takes 45 minutes, during which the program is only doing something for about 8, while the rest is waiting on the network. It would definitely be a good exercise to learn more about the concurrency capabilities of Haskell, although the hidden system state in HXT makes me nervous about whether it’ll work at all. I’d probably want to do a couple simpler exercises in concurrent programming first, before attempting to parallelize this one.
I have a few remaining complaints / suggestions for HXT. One which I believe Uwe is already thinking about adding, is an option for adapting the parsing based on the content type in the response. Currently, you have to specify HTML or XML parsing in the call to
readDocument. This is not terribly useful in an application like this one. It would be much nicer if HXT used XML parsing if the content type is XML, HTML parsing when it’s text/html, and complained on anything else (like, image/jpeg). Another frustration I had was that tracking parsing and validation errors to their source was very difficult in some cases. A missing end tag frequently doesn’t produce a parse error until much later in the document. The validator would catch this much earlier, but HXT does parsing and validation in two separate passes. One can insert missing end tags at the point of the parse error and then look at the resulting validation error, but the tree that the validator operates on doesn’t have any line or column information, so you can’t easily track a validation error down to a specific location in the source file. Presumably the nodes in the tree could be augmented with this data fairly simply, but the other shortcomings of the two-pass approach are undoubtedly much more difficult to fix.
2007-10-24 09:10 in /tech/haskell
2007-09-30 11:30 in /tech/haskell
Once HaXML proved unsuitable for validating XTHML, I turned my attention to HXT, the Haskell XML Toolkit. While the API for HaXML looked pretty similar to what I might have designed myself, HXT has more of a learning curve. In particular, it is based on the arrow computational structure. Like monads, arrows require learning new syntax and a new conceptual model. Unlike monads, where tutorials are a dime a dozen, there’s little out there to help you learn to use arrows effectively. This is complicated by the fact that HXT extensively extends the base Arrow definition, with little additional documentation.
(My one sentence explanation of arrows is that they model computation as pipeline elements which can be performed in sequence or in parallel.)
Despite the paucity of documentation, I got much further along with HXT. In fact, I have a complete working program, except for a “but” that would satisfy Sir Mix-a-Lot. I’m going to show most of this program, with annotations, then explain how things go wrong.
type MyArrow b c = IOStateArrow (Set.Set String) b c runUrl url = runX (constA url >>> setTraceLevel 1 >>> withOtherUserState Set.empty >>> split >>> checkUrl )
HXT throws attempts at purity and separation of concerns to the wind, and pushes everything it does into an
IOStateArrow(underneath which are the IO and State monads). The state is separated into a system state and a user state, which is
()by default. Because I’m going to want to track URLs that I’ve crawled, I specify a Set of Strings for my state. This code shoves the seed url into an arrow using
constA, enables a low level of tracing, and sets up my initial state. With that setup done, we can start doing some real work. (The split will become clear in a second.)
checkUrl :: MyArrow (String,String) String -- (url, base) checkUrl = clearErrStatus >>> ifA (first seenUrl) (fst ^>> traceString 1 ("Skipping (seen already) " ++) >>> none) (first markSeen >>> ifP isLocal checkLocalUrl checkRemoteUrl)
This function checks to see if the URL has been seen before, by checking it against the Set in the user state, and if it has we emit a debugging message and then stop. (
noneis a function on ListArrows that essentially produces an empty list, signifying no more work to be done.) If this is a new URL, we mark it as seen, then branch based on whether it is a local URL or a remote URL. This is where the split above comes in — we’ll be keeping track of the previous URL which this on was linked from, in order to figure out when we are leaving the original website.
The most mysterious part of
checkUrlis the first line. Originally I did not have this, and I observed that the spider would run for a while, but terminate before the whole site was crawled. After adding some additional debugging statements, I discovered something which I am inclined to consider a bug in HXT. After a document is encountered with errors in validation, something gets set in the global error state which causes all further attempts to read in a document to fail silently. So, after the spider found it’s first errors, it would terminate shortly thereafter, as it wasn’t managing to pick up any new URLs to crawl. The addition of
clearErrStatusbefore each new fetch prevents this failure.
checkLocalUrl :: MyArrow (String, String) String checkLocalUrl = constructLocalUrl >>> split >>> first ( traceString 0 ("Checking " ++) >>> readFromDocument  >>> selectLinks >>> traceString 1 ("Found link: " ++) ) >>> checkUrl selectLinks :: ArrowXml a => a XmlTree String selectLinks = deep (isElem >>> hasName "a" >>> getAttrValue "href" >>> mkText) >>> getText
checkLocalUrlexpands any relative URLs, then reads in the resulting URL and selects out any hrefs from the document. The result is a list of new URLs to crawl, paired with the URL of this document, which we pass recursively back into
checkUrl. What’s implicit in this code is that
readFromDocumentvalidates the document by default, and in addition to fetching the document itself also fetches the DTD including any subparts, thus avoiding the difficulties I had with HaXML. Somewhat oddly, the library simply prints the validation errors, rather than returning them from the function, but that’s something I can live with in this application. (I think it would be possible to specify an alternate error handler if you wanted to store the errors for later processing.)
checkRemoteUrlis not terribly interesting, and for the purposes of this exposition, you can just consider it to be a synonym for
checkRemoteUrl = none
This code seems to be correct, BUT..... I set it running on my website and it chugs along for a while. Then it hits a particular page (of perfectly valid XHTML, by the way), and starts validation and just never stops. I let it run for about 40 minutes with the CPU pegged before killing the process. Some further investigations with a pared down version of the document showed that it’s not in an infinite loop, but that it’s in some nasty, presumably exponential, blowup in the regular expression engine. The source of this blowup is somehow non-local: removing either half of a list of paragraphs eliminates the problem, removing an enclosing div eliminates the problem, etc. The author of HXT chose to implement his own regex engine based on the concept of “derivatives of regular expressions”, which I take it are interesting academically but, it would seem, perhaps not ideal practically speaking.
At the moment, this is where I’m stuck. I’m pretty comfortable with the program as it stands, but the library is letting me down. Fortunately, this problem seems more tractable than the HaXML problem. The decision is then whether to a) wait for the maintainer to fix it, b) try to fix it myself, c) try to replace the custom regex engine in HXT with one of the standard implementations, or d) rip out the regular expressions from the validation entirely and replace them with standard string functions which would be guaranteed to perform linearly. The biggest obstacle to working on this myself is not actually the code, but the fact that it takes nearly an hour to recompile the library on my computer.
Verdict: HXT has a somewhat steep learning curve, and the API is a little rough around the edges in places, particularly the state handing parts. There is desperate need for a better, more comprehensive, tutorial. This library was written as someone’s masters thesis, and the code has the look of something which has never seen serious code review (e.g., lots of typos, even in function names). I can see no good reason to reimplement regular expressions for this task. Actually, I can see no good reason to use regular expressions at all for this task. This portion of the code should be completely overhauled. On the other hand, it is easy to add debugging statements and to thread state through the computation, and the arrow API has a certain elegance to it.
To come: A resolution? Parallelization?
2007-09-29 23:40 in /tech/haskell
I’ve been working on writing a validating web spider in Haskell, mostly because it’s something that would be useful for me. I want to be able to validate the markup on my entire web site and the external services for (X)HTML validation usually only do 100 pages before they stop crawling. I also want to learn Haskell better. Of course, I recognize that this type of task is not Haskell’s forte, and that’s exactly why I chose it. I want to push the edges of what the language is good at to understand its weaknesses as a general purpose language, as well as its strengths. Along the way, I’ve been learning a lot about monads, arrows, error handling, managing state, as well as the usefulness of the community and the general quality of the libraries.
There are three XML libraries for Haskell: HXML, HaXML, and HXT. HXML does not provide any validation functions, so it was immediately out of the running. HaXML seems relatively simple and was the next library I located, so I first attempted to use it for this task. I’ll talk about HXT in the next article in this series. This presentation is not strictly chronological, as I’ve bounced back and forth between HaXML and HXT when I’ve hit problems with one or the other.
HaXML is fairly minimalist. It doesn’t deal with fetching content off the net, and it doesn’t force you into the IO or any other monad. On the one hand, this does seem like good library design, but it also leads to some shortcomings. A big one is that it’s hard to debug relative to HXT because you can’t just toss in trace messages at will, and, for added frustration, the data types don’t derive from
Show, so even once you get them to the outer layers of your program and have access to the IO monad, you still can’t easily display them.
Despite this, the naive approach to fetching, parsing, and validating a document seems simple. Use the
Network.URIto fetch a URL, then call
xmlParseon the content, and then
validateon the resulting DOM. (I’m skipping over handing all the error cases which can arise.)
checkUrl :: String -> IO [String] checkUrl url = do contents <- getUrl url case parse url contents of Left msg -> return [msg] Right (dtd, doc) -> return $ validate dtd doc parse :: String -> String -> Either String (DocTypeDecl, Element) parse file content = case dtd of Left msg -> Left msg Right dtd' -> Right (dtd', root) where doc = xmlParse file content root = getRoot doc dtd = getDtd doc
Unfortunately, this fails miserably and reports that every single element and attribute in the document is invalid.
I puzzled over this for a while before I realized that while my XML document contained a
DocTypeDeclwithin it, this object was useless for validating because it only contained the identifier information for the DTD, but none of the contents. Notice that we never actually fetched the DTD from the network in the above sequence. Once I realized this, I added a little more code to get the system identifier out of the DTD declaration, fetch that document, and run it through
dtdParse, then validate using that
checkUrl :: String -> IO [String] checkUrl url = do contents <- getUrl url case parse url contents of Left msg -> return [msg] Right (dtd,doc) -> getDtdAndValidate dtd doc getDtdAndValidate :: DocTypeDecl -> Element -> IO [String] getDtdAndValidate (DTD name mID _) doc = do case mID of Nothing -> return ["No external ID for " ++ name] Just id -> fetchExternal id where fetchExternal :: ExternalID -> IO [String] fetchExternal id = do dtd <- fetchDtd (systemLiteral id) return $ validate dtd doc systemLiteral (SYSTEM (SystemLiteral s)) = s systemLiteral (PUBLIC p (SystemLiteral s)) = s fetchDtd :: String -> IO DocTypeDecl fetchDtd dtd = do contents <- getUrl dtd case dtdParse dtd contents of Nothing -> error "No DTD" Just dtd' -> return dtd'
This would probably work for many cases, but it fails for XHTML because the DTD is actually split across multiple files. As a result, when it encounters an ENTITY declaration from an external file, it fails with: *** Exception: xhtml-lat1.ent: openFile: does not exist (No such file or directory). This appears to be a dead-end. Short of parsing the DTD myself, finding the external references, fetching them, and doing the substitution; there seems to be no way around this problem.
Verdict: The API of HaXML looks more-or-less like you’d expect, although it would be really nice to derive from
Show, at least for the major data types, and the addition of named fields would make things a little easier for users. Unfortunately, it seems like what I would think to be a common use case was not considered, rendering the library limited in utility. I also can’t speak to the quality of the filter portion of the library, since if I couldn’t validate my seed document, there was no point in extracting links from it to crawl further.
To come: adventures with HXT...
2007-03-04 16:10 in /tech/haskell
I’m now at about 1 month into learning Haskell. I’ve been moving a little slower than I might have, partly because I’ve been trying to make sure it doesn’t crowd out other activities and partly because in theory Chris is going to catch up with me and we’ll work on this together. It’s still the case that nothing is particularly frustrating me about the language, although I’m still a little shaky on some issues of precedence, and I’m sure my style and idiom leave something to be desired. (Okay, sometimes the compiler error messages are frustratingly hard to decipher. On the other hand, they tend to point out what would have been frustrating-to-debug errors, like those precedence mistakes I continue to make.)
I’ve been venturing into little programs involving IO, since everyone seems to think that’s the scary part of Haskell. Honestly, I didn’t find it that bad. Maybe the tutorials are just getting better. I was going to write something about how I’ve come to understand IO in Haskell, but Eric at Nub Games did such an excellent job with Haskell IO for Imperative Programmers that I find I have nothing to add.
So far, the largest program I’ve written (which is not very large) is a recursive directory lister. I figured this would be a good test that I understood how to work with IO, since it was going to involve sucking things into the IO monad recursively. As I was working on it, I saw a couple other blogs posts about writing the same sort of thing. I guess other people had the same thinking. Mine is a little different than some others, because it takes input from the command line, and can work with either an initial file or directory. This is it, if anyone cares: walker.hs. (I implemented my own, non-portable, path concatenation, since the standard libraries inexplicably don’t provide this. I got a reference to a FilePath library that handles this, but didn’t want to bother installing it for a simple toy exercise.)
I’ve been thinking about how or when I’ll declare that I’ve learned Haskell well enough to satisfy my 101 in 1001 goal. I jotted down some minimal requirements:
- Complete all the CS 11 labs. I’ve done the first 4 of the 5 posted so far. I assume there will be 9 or 10 total.
- I’ve finished reading A Gentle Introduction to Haskell, although I should re-read the Arrays chapter. My next reading goal is All About Monads. After that, probably a proper book.
- Work through all of the 99 Haskell Problems. I’ve done all but one of the first 40, and I’m trying to complete about 10 per week, although that might slow down as they get harder.
- I’m thinking I’m going to write a simple web spider to do validation and link checking on my website. I pseudocoded most of this a couple nights ago, but I need to go hunt down libraries, particularly for the HTML/XHTML parsing and validation.
Once I’ve done all that, I’ll re-assess and decide if there’s more I need to do to feel like I can say that I’ve learned the language to a reasonable degree.
Oh, and I’ve been occasionally hanging out on #haskell. I’m really impressed with how friendly and helpful the people there are. IRC that’s actually useful. Who knew it existed?
2007-02-12 18:20 in /tech/haskell
I’ve spent a modest amount of time in the last week starting to learn Haskell. Basically, I’ve read about half of A Gentle Introduction to Haskell and bits of various other tutorials, and I’ve coded solutions to the first 20 of 99 Haskell Problems, and completed Lab 1 of Caltech CS 11: Haskell Track. So far, nothing has been particularly strenuous. Actually, the toughest thing was figuring out how to address the problem of GHCI refusing to print my user-defined datatypes (and, as I was writing this and flipping through other materials, I realized that I did this the hard way, which is good because I was going to complain bitterly about it otherwise).
Some random thoughts and observations:
Ints— This makes me sad
- I really like the property of the ML family of languages that almost all the time, once your code compiles, it runs correctly (assuming you actually understand the problem you’re trying to solve).
- I’m still a little unclear on the best way to unit test Haskell code. I haven’t found a good tutorial on this yet. I’m also a little annoyed that there’s no haddock for HUnit. It seems like a real sign of immaturity in a language to ship something in the standard distribution with no documentation.
- Soon, I should start reading one of the actual books about the language, but I haven’t decided which yet. Also, I need to come up with some real-life problems to try solving with Haskell, to get a better feel for its practical strengths and weaknesses.