Archive for the ‘Software’ Category

Duplicating ggplot axis labels

Wednesday, July 21st, 2010

I’ve been trying for a while to find an elegant solution for duplicating axis ticks and labels in a ggplot chart. Hadley replied on the ggplot2 mailing list, but a working solution within ggplot2 seems a way off.

The situation is this: imagine you have a faceted plot that is tall enough that the x-axis ticks and labels become obscured (e.g. when using a clipped viewport such as a browser window). This is particularly destructive when you’re using an x-scale with manual breaks or a transformation.

library(ggplot2)
g <- ggplot(diamonds, aes(carat, ..density..)) +
   geom_histogram(aes(fill = clarity), binwidth = 0.2) +
   facet_grid(cut ~ .)
print(g)

Faceted Plot where the x-axis labels have been clipped out

There simply isn’t a way to repeat the x-axis labels in ggplot2 at the moment without discarding faceting and rendering each facet as a separate ggplot call. I’ve seen some examples of selective plotting used to good effect in combining multiple plots with common elements, but I can’t find anything applicable to keep consistent scales and binning without duplicating a lot of the (internal) facet and bin logic.

(more…)

Aho-Corasick string matching in Haskell

Sunday, July 18th, 2010

The Aho-Corasick string matching algorithm constructs an automaton for matching a dictionary of patterns. When applied to an input string, the automaton’s time complexity is linear in the length of the input, plus the number of matches (so at worst quadratic in the input). It’s been around since 1975, but it isn’t implemented in the Haskell stringsearch library and I couldn’t even find a general trie data structure from google. So I implemented the Aho-Corasick algorithm myself: take a look at the full Aho-Corasick module.

There was an interesting paper on deriving the algorithm as a result of applying fully-lazy evaluation and memoization on a more naive algorithm. Unfortunately, applying fully-lazy evaluation and memoization to a function in Haskell is non-trivial (despite it being theoretically possible for the compiler to do so!).

It’s always interesting trying to find the functional equivalent to an imperative algorithm. I ended up using some cute Haskell tricks.

Update: I’ve written an improved version of Aho-Corasick implemented with Data.Array and Data.Map

(more…)

R and LaTeX PDF graphics

Monday, May 17th, 2010

When writing a document in LaTeX that makes use of figures from R, I want to produce a PDF with

  • vector graphics,
  • consistent fonts,
  • not to mess around overlaying text in LaTeX,

and maybe typeset math in the R graphics. This post surveys the state of the art in how to achieve the best of all worlds when importing graphics generated by R into documents typeset to PDF with LaTeX. I look at postscript and PDF figures generated by R’s X11, Cairo, and finally the new (and awesome) TikZ devices.

(more…)

Debian almquist shell on Mac OS X

Tuesday, May 4th, 2010

I’ve recently changed the default shell /bin/sh on my Leopard install to dash from bash. Why would I do this? Well, by bash’s own admission (see BUGS in its manpage), bash is “too big and too slow”. dash is used for /bin/sh by default on Ubuntu nowadays, and it’s a goal for Debian (supposedly for Lenny, but I can’t see it noted in the release notes and it’s only “confirmed” in the Lenny goals). dash is significantly smaller and faster.

Ubuntu and Debian ran into plenty of bashisms when they tried to change, how about Mac OS X? So far I’ve found problems with /usr/libexec/path_helper (I just changed the shebang line to #!/bin/bash). There was also a problem with X11′s startx, and my patch was quickly committed.

Interestingly enough, when I went to move the sh binary, there are actually two versions of bash in /bin. Both report

GNU bash, version 3.2.17(1)-release (i386-apple-darwin9.0)

but differ at the binary level (they aren’t even the same size). I wonder if Apple tried to optimize their /bin/sh given that it gets more usage.

I built dash-0.5.5.1 from the tarball without a problem.