The Aho-Corasick string matching algorithm constructs an automaton for matching a dictionary of patterns. When applied to an input string, the automaton’s time complexity is linear in the length of the input, plus the number of matches (so at worst quadratic in the input). It’s been around since 1975, but it isn’t implemented in the Haskell stringsearch library and I couldn’t even find a general trie data structure from google. So I implemented the Aho-Corasick algorithm myself: take a look at the full Aho-Corasick module.
There was an interesting paper on deriving the algorithm as a result of applying fully-lazy evaluation and memoization on a more naive algorithm. Unfortunately, applying fully-lazy evaluation and memoization to a function in Haskell is non-trivial (despite it being theoretically possible for the compiler to do so!).
It’s always interesting trying to find the functional equivalent to an imperative algorithm. I ended up using some cute Haskell tricks.
Update: I’ve written an improved version of Aho-Corasick implemented with Data.Array and Data.Map
Instead of a BFS to compute the failure function, I propagate a recursive function forward as the trie is constructed. The separate mkRoot
provides the base case with which to tie-the-knot.
mkRoot xs = let root = Root (edge [] (sort xs) root) in root mkTrie prefix f xs = Node goto prefix ((not.null) self) f where goto = edge prefix kids =<< (failTo f) (self, kids) = if null (head xs) then ([head xs], tail xs) else ([], xs)
Instead of using a list to implement the branches of a rose tree, I used partial-application over edge
. This certainly looks elegant, but in fact it is the weak point, as withPrefix
is a linear search; the imperative approach is an O(1) lookup (with small alphabets) or O(log m) over m branches. Furthermore, the lazy evaluation of edge
means that the trie is being constantly reconstructed as it is traversed by the automaton.
data Trie = Node (Char -> Maybe Trie) String Bool Trie | Root (Char -> Maybe Trie) edge :: String -> [String] -> Trie -> Char -> Maybe Trie edge prefix xs f c = if null (withPrefix c) then Nothing else Just (mkTrie (c:prefix) f (map tail (withPrefix c))) where withPrefix c = takeWhile ((c==) . head) . dropWhile ((c>) . head) $ xs
Obviously it’s not generic over types or anything, but it should work fine with lists of types other than Char
.
The following pathological case didn’t run too badly (25 seconds for m=50, n=100000 on a 2.16 GHz Core 2 Duo Macbook Pro, compiled with ghc -O2
). Profiling it revealed 20 million entries into edge
; which easily dominates the timing. Oddly enough this just seems to be a large constant—other samples suggest it’s linear in the product m n.
main = do args <- getArgs let (m:n:_) = map (fst . head . readDec) args patterns = (take m . tails . concat . take 25 . repeat) "ab" haystack = (concat . take n . repeat) "ab" putStr $ show (length (findMatches patterns haystack))