Improved Aho-Corasick in Haskell

I was disappointed at the poor performance of my last attempt at string matching in Haskell, so I decided to rewrite it using the high-performance Data.Map and Data.Array data structures. This makes the semantics a lot closer to an imperative C implementation. The result isn’t so elegant, but it is predictable due to guarantees on the asymptotic performance of the array and map functions. It’s also fast.

This highlights one of the big problems in writing fast Haskell: the elegant, recursive, list-based style you learnt is unpredictable in terms of time complexity. You know not to use ++ where cons would do, and more generally aim for tail recursion; but when you need to decide what should be boxed/unboxed/strict/lazy it’s very frustrating. Without being very familiar with what reduction strategy the compiler used, it was unclear what was actually going on with my previous code until I pulled out the profiler.

Continue reading “Improved Aho-Corasick in Haskell”

Aho-Corasick string matching in Haskell

The Aho-Corasick string matching algorithm constructs an automaton for matching a dictionary of patterns. When applied to an input string, the automaton’s time complexity is linear in the length of the input, plus the number of matches (so at worst quadratic in the input). It’s been around since 1975, but it isn’t implemented in the Haskell stringsearch library and I couldn’t even find a general trie data structure from google. So I implemented the Aho-Corasick algorithm myself: take a look at the full Aho-Corasick module.

There was an interesting paper on deriving the algorithm as a result of applying fully-lazy evaluation and memoization on a more naive algorithm. Unfortunately, applying fully-lazy evaluation and memoization to a function in Haskell is non-trivial (despite it being theoretically possible for the compiler to do so!).

It’s always interesting trying to find the functional equivalent to an imperative algorithm. I ended up using some cute Haskell tricks.

Update: I’ve written an improved version of Aho-Corasick implemented with Data.Array and Data.Map

Continue reading “Aho-Corasick string matching in Haskell”

Muxic Beta 2, XMMS2 0.6 Macport

I finally got around to updating Muxic to the new xmms2-0.6DrMattDestruction API. There were a few changes to the xmms2 clientlib that needed attention.

While I was doing this, I noticed that the xmms2 CF clientlib isn’t just superficially unusable, but will rev up your CPU something awful if the xmms2d drops its connection after registration of the clientlib CF runloop event source. It needs a shutdown function, and that shutdown function needs some connection-specific CF variables from the init function. Hence we can’t use the API in its current state (unless there’s some way to inject mainloop-specific state into the xmmsc_connection_t that I don’t know about).

I’ve reported it and presented a patch as XMMS2 bug #2232, but essentially it means that the libxmmsclient-cf that currently comes with xmms2 is very bugged.

Update: patch accepted into xmms2-devel.git.

Since macports haven’t updated their port of xmms2, here is my xmms2 @0.6DrMattDestruction portfile. I would have liked to support variants, but whenever I add my own –with-optionals or –with-plugins options to waf configure, the build breaks with link errors. This happens even when I copy the exact list that waf configure automagically selects without arguments. I haven’t figured out why this is, and trying to trace the internals of waf is a bitch.

Anyway, presenting Muxic Beta 2, source code available in the Muxic git repository as usual.