Aho-Corasick string matching in Haskell

The Aho-Corasick string matching algorithm constructs an automaton for matching a dictionary of patterns. When applied to an input string, the automaton’s time complexity is linear in the length of the input, plus the number of matches (so at worst quadratic in the input). It’s been around since 1975, but it isn’t implemented in the Haskell stringsearch library and I couldn’t even find a general trie data structure from google. So I implemented the Aho-Corasick algorithm myself: take a look at the full Aho-Corasick module.

There was an interesting paper on deriving the algorithm as a result of applying fully-lazy evaluation and memoization on a more naive algorithm. Unfortunately, applying fully-lazy evaluation and memoization to a function in Haskell is non-trivial (despite it being theoretically possible for the compiler to do so!).

It’s always interesting trying to find the functional equivalent to an imperative algorithm. I ended up using some cute Haskell tricks.

Update: I’ve written an improved version of Aho-Corasick implemented with Data.Array and Data.Map

Instead of a BFS to compute the failure function, I propagate a recursive function forward as the trie is constructed. The separate mkRoot provides the base case with which to tie-the-knot.

mkRoot xs = let root = Root (edge [] (sort xs) root) in root
mkTrie prefix f xs = Node goto prefix ((not.null) self) f
    goto = edge prefix kids =<< (failTo f)
    (self, kids) = if null (head xs) then ([head xs], tail xs) else ([], xs)

Instead of using a list to implement the branches of a rose tree, I used partial-application over edge. This certainly looks elegant, but in fact it is the weak point, as withPrefix is a linear search; the imperative approach is an O(1) lookup (with small alphabets) or O(log m) over m branches. Furthermore, the lazy evaluation of edge means that the trie is being constantly reconstructed as it is traversed by the automaton.

data Trie = Node (Char -> Maybe Trie) String Bool Trie 
          | Root (Char -> Maybe Trie)

edge :: String -> [String] -> Trie -> Char -> Maybe Trie
edge prefix xs f c =
  if null (withPrefix c)
  then Nothing
  else Just (mkTrie (c:prefix) f (map tail (withPrefix c)))
    withPrefix c = takeWhile ((c==) . head) . dropWhile ((c>) . head) $ xs

Obviously it’s not generic over types or anything, but it should work fine with lists of types other than Char.

The following pathological case didn’t run too badly (25 seconds for m=50, n=100000 on a 2.16 GHz Core 2 Duo Macbook Pro, compiled with ghc -O2). Profiling it revealed 20 million entries into edge; which easily dominates the timing. Oddly enough this just seems to be a large constant—other samples suggest it’s linear in the product m n.

main = do
  args <- getArgs
    (m:n:_) = map (fst . head . readDec) args
    patterns = (take m . tails . concat . take 25 . repeat) "ab"
    haystack = (concat . take n . repeat) "ab"
  putStr $ show (length (findMatches patterns haystack))

Leave a Reply

Your email address will not be published. Required fields are marked *