Improved Aho-Corasick in Haskell

I was disappointed at the poor performance of my last attempt at string matching in Haskell, so I decided to rewrite it using the high-performance Data.Map and Data.Array data structures. This makes the semantics a lot closer to an imperative C implementation. The result isn’t so elegant, but it is predictable due to guarantees on the asymptotic performance of the array and map functions. It’s also fast.

This highlights one of the big problems in writing fast Haskell: the elegant, recursive, list-based style you learnt is unpredictable in terms of time complexity. You know not to use ++ where cons would do, and more generally aim for tail recursion; but when you need to decide what should be boxed/unboxed/strict/lazy it’s very frustrating. Without being very familiar with what reduction strategy the compiler used, it was unclear what was actually going on with my previous code until I pulled out the profiler.

My approach constructs a trie as before, this time using a Data.Map instead of partial application for the goto function. There’s quite an ugly fold here to make sure all the nodes get labeled with ids (which are essentially just pointers for when the trie gets flattened into a Data.Array).

mkTrie :: Int -> String -> [String] -> Trie -> Trie
mkTrie i prefix xs f =
  case (null prefix, null (head xs)) of
    (True, _) -> Root (nextid bs) (Map.fromList bs)
    (_, True) -> Output (if null bs then i else nextid bs) branches f prefix
    (_, False) -> Trie (nextid bs) branches f
    nextf = if null prefix then const else failTo
    nextid = (+) 1 . nodeid . snd . head
    -- q maps groups to an assoc list assigning postordered ids
    q [] (c, ys) = (c, (mkTrie i (c:prefix) ys (nextf f c))) : []
    q ts (c, ys) = (c, (mkTrie (nextid ts) (c:prefix) ys (nextf f c))) : ts
    branches = Map.fromList bs
    bs = (foldl q [] .
          map (liftM2 (,) (head . head) (map tail)) .
          groupBy ((. head) . (==) . head) .
          filter (not.null)) xs

Check out the full AhoCorasick module.

Leave a Reply

Your email address will not be published.


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>