I was disappointed at the poor performance of my last attempt at string matching in Haskell, so I decided to rewrite it using the high-performance Data.Map
and Data.Array
data structures. This makes the semantics a lot closer to an imperative C implementation. The result isn’t so elegant, but it is predictable due to guarantees on the asymptotic performance of the array and map functions. It’s also fast.
This highlights one of the big problems in writing fast Haskell: the elegant, recursive, list-based style you learnt is unpredictable in terms of time complexity. You know not to use ++
where cons would do, and more generally aim for tail recursion; but when you need to decide what should be boxed/unboxed/strict/lazy it’s very frustrating. Without being very familiar with what reduction strategy the compiler used, it was unclear what was actually going on with my previous code until I pulled out the profiler.
My approach constructs a trie as before, this time using a Data.Map
instead of partial application for the goto function. There’s quite an ugly fold here to make sure all the nodes get labeled with ids (which are essentially just pointers for when the trie gets flattened into a Data.Array
).
mkTrie :: Int -> String -> [String] -> Trie -> Trie mkTrie i prefix xs f = case (null prefix, null (head xs)) of (True, _) -> Root (nextid bs) (Map.fromList bs) (_, True) -> Output (if null bs then i else nextid bs) branches f prefix (_, False) -> Trie (nextid bs) branches f where nextf = if null prefix then const else failTo nextid = (+) 1 . nodeid . snd . head -- q maps groups to an assoc list assigning postordered ids q [] (c, ys) = (c, (mkTrie i (c:prefix) ys (nextf f c))) : [] q ts (c, ys) = (c, (mkTrie (nextid ts) (c:prefix) ys (nextf f c))) : ts branches = Map.fromList bs bs = (foldl q [] . map (liftM2 (,) (head . head) (map tail)) . groupBy ((. head) . (==) . head) . filter (not.null)) xs
Check out the full AhoCorasick module.