Aho-Corasick string matching in Haskell

The Aho-Corasick string matching algorithm constructs an automaton for matching a dictionary of patterns. When applied to an input string, the automaton’s time complexity is linear in the length of the input, plus the number of matches (so at worst quadratic in the input). It’s been around since 1975, but it isn’t implemented in the Haskell stringsearch library and I couldn’t even find a general trie data structure from google. So I implemented the Aho-Corasick algorithm myself: take a look at the full Aho-Corasick module.

There was an interesting paper on deriving the algorithm as a result of applying fully-lazy evaluation and memoization on a more naive algorithm. Unfortunately, applying fully-lazy evaluation and memoization to a function in Haskell is non-trivial (despite it being theoretically possible for the compiler to do so!).

It’s always interesting trying to find the functional equivalent to an imperative algorithm. I ended up using some cute Haskell tricks.

Update: I’ve written an improved version of Aho-Corasick implemented with Data.Array and Data.Map

Continue reading “Aho-Corasick string matching in Haskell”

R and LaTeX PDF graphics

When writing a document in LaTeX that makes use of figures from R, I want to produce a PDF with

  • vector graphics,
  • consistent fonts,
  • not to mess around overlaying text in LaTeX,

and maybe typeset math in the R graphics. This post surveys the state of the art in how to achieve the best of all worlds when importing graphics generated by R into documents typeset to PDF with LaTeX. I look at postscript and PDF figures generated by R’s X11, Cairo, and finally the new (and awesome) TikZ devices.

Continue reading “R and LaTeX PDF graphics”

Debian almquist shell on Mac OS X

I’ve recently changed the default shell /bin/sh on my Leopard install to dash from bash. Why would I do this? Well, by bash’s own admission (see BUGS in its manpage), bash is “too big and too slow”. dash is used for /bin/sh by default on Ubuntu nowadays, and it’s a goal for Debian (supposedly for Lenny, but I can’t see it noted in the release notes and it’s only “confirmed” in the Lenny goals). dash is significantly smaller and faster.

Ubuntu and Debian ran into plenty of bashisms when they tried to change, how about Mac OS X? So far I’ve found problems with /usr/libexec/path_helper (I just changed the shebang line to #!/bin/bash). There was also a problem with X11’s startx, and my patch was quickly committed.

Interestingly enough, when I went to move the sh binary, there are actually two versions of bash in /bin. Both report

GNU bash, version 3.2.17(1)-release (i386-apple-darwin9.0)

but differ at the binary level (they aren’t even the same size). I wonder if Apple tried to optimize their /bin/sh given that it gets more usage.

I built dash-0.5.5.1 from the tarball without a problem.