Aho-Corasick string matching in Haskell

The Aho-Corasick string matching algorithm constructs an automaton for matching a dictionary of patterns. When applied to an input string, the automaton’s time complexity is linear in the length of the input, plus the number of matches (so at worst quadratic in the input). It’s been around since 1975, but it isn’t implemented in the Haskell stringsearch library and I couldn’t even find a general trie data structure from google. So I implemented the Aho-Corasick algorithm myself: take a look at the full Aho-Corasick module.

There was an interesting paper on deriving the algorithm as a result of applying fully-lazy evaluation and memoization on a more naive algorithm. Unfortunately, applying fully-lazy evaluation and memoization to a function in Haskell is non-trivial (despite it being theoretically possible for the compiler to do so!).

It’s always interesting trying to find the functional equivalent to an imperative algorithm. I ended up using some cute Haskell tricks.

Update: I’ve written an improved version of Aho-Corasick implemented with Data.Array and Data.Map

Continue reading “Aho-Corasick string matching in Haskell”

monadic parser combinators with parsec

I decided to pull out Haskell for my Linguistics 1101 Phrase Structure Rules Assignment. It seemed like a good opportunity to play with these monadic parser combinator things, which sound impressive if nothing else. The result was pleasing, although I’m not sure if my tutor will appreciate it.

It was fun revisiting Haskell, and writing parsers directly using Parsec is certainly a novel alternative to using a Bison-style compiler-compiler. Spirit was similar, but C++ can become so syntactically clunky some of the joy is lost.

I’m not sure whether it was something specific to the Parsec paradigm, my abuse of Parsec, or my ignorance of Haskell and monadic programming in general, but I kept finding myself on the wrong side of Monads and do-expressions. It seems you have to use liftM a lot.

In looking for a generalisation of the liftMn functions I came up with:

foldMLr :: (Monad m) => (t -> a -> a) -> a -> [m t] -> m a
-- foldMLr f u xs binds the monads in xs headfirst, and folds their results
-- from the right using f and u as the rightmost.
foldMLr _ u [] = return u
foldMLr f u (x:xs) = do { a <- x ; b <- foldMLr f u xs ; return (f a b) }
-- equivalently:
-- foldMLr f u (x:xs) = liftM2 f x (foldMLr f u xs)

which is not the same as foldM but is a generalisation of sequence, which can be defined as foldMLr (:) []. I didn’t end up using it in the final parser.

Another issue was that constructing a parse tree (using Data.Tree types) was actually somewhat tedious. I guess Parsec assumes that you want to fold up the result within the parsers.

Also watch out for the change in showErrorMessages, in ghc it takes some extra initial string arguments that weren’t there in the standalone release.