Search engine optimization with git web interfaces

I recently became frustrated with gitweb’s funky query-strings and decided to give cgit a try. Although there are some patches that make gitweb more user (and search engine) friendly, cgit is a much better web-interface for git, both in terms of the code and the actual user experience. However, there were still some opportunities for SEO.

I went through the HTML suggestions from the google webmaster tools and Google’s own SEO Starter Guide. I’ve pushed the search engine optimized cgit to my seo branch on github. You can see it in action at my git repositories. I’m testing all of this using an Apache ScriptAlias directive, I’m hoping it will still work alright with whatever other URL-processing schemes cgit supports. A short summary of the new SEO features so far:

  • Use HTML h1 and h2 heading tags instead of custom-styled divs
  • Much better title tags; commits have the commit subject, and the repo name has been added in a lot of places to avoid duplicate titles
  • The bread-crumb has been integrated into the heading
  • A configurable option to set nofollow relationships on links to non-HEAD commits, to avoid duplicate content being indexed

Of course, you could take the popular option of just using github instead of self-hosting your own git web interfaces… but even they don’t do quite a good a job IMO, they use the SHA1 in the web page titles, eww!

dudders and reliable DNS zone updates

I’ve released a new version of dudders, 1.04, and finally submitted it as a package to OpenWRT. The focus of this release was on making the update more robust to network failure as the result of an email correspondence with Peter Holik. I am of the opinion that DNS UPDATE is a strong candidate for being TCP by default (along with zone-transfers).

In RFC 1123 it is stipulated that:

a DNS resolver or server that is sending a non-zone-transfer query MUST send a UDP query first.

However, if you are doing a DNS UPDATE you really want the reliability that TCP offers, even if you don’t expect truncation to be an issue. The update is sent to the relevant authority server, so the arguments about load on root servers in the RFC aren’t applicable.

I’ve made the UDP implementation retry by default, but I think if you need more than 2 retries, you should be considering using TCP with its (much more advanced) retransmission algorithms.

Peter also found a bug in glibc’s res_send (actually in their send_dg function) whereby the resolver interprets the lack of the DNS “recursion available” flag in the header as an error. However, that flag isn’t even meaningful for DNS UPDATE responses; according to RFC 2136, those bits:

Should be zero (0) in all requests and responses. A non-zero Z field should be ignored by implementations of this specification.

As a result, glibc was setting errno to ECONNREFUSED or ETIMEDOUT even when the update was successful. I’ve hacked dudders to double-check after res_send, but it’s making me question the wisdom of using res_send at all, given that I’m constantly working around it.

Update: submitted glibc bug report #11950

To get dudders-1.04 on OpenWRT, simply update the official package feed and select dudders from the Net > DNS > dudders menu in the buildroot config. For systems other than OpenWRT, you can grab the source from sourceforge, or even github.

Improved Aho-Corasick in Haskell

I was disappointed at the poor performance of my last attempt at string matching in Haskell, so I decided to rewrite it using the high-performance Data.Map and Data.Array data structures. This makes the semantics a lot closer to an imperative C implementation. The result isn’t so elegant, but it is predictable due to guarantees on the asymptotic performance of the array and map functions. It’s also fast.

This highlights one of the big problems in writing fast Haskell: the elegant, recursive, list-based style you learnt is unpredictable in terms of time complexity. You know not to use ++ where cons would do, and more generally aim for tail recursion; but when you need to decide what should be boxed/unboxed/strict/lazy it’s very frustrating. Without being very familiar with what reduction strategy the compiler used, it was unclear what was actually going on with my previous code until I pulled out the profiler.

Continue reading “Improved Aho-Corasick in Haskell”

XHTML fixes for the WordPress reCAPTCHA plugin

The wp-recaptcha plugin for WordPress breaks when you’re serving pages as application/xhtml+xml. I inadvertently broke comments when I installed it (silly me for not testing!). I’ve written a patch that fixes it.

The default javascript API uses document.write, which isn’t a DOM method and hence is not a method of true XML documents. It’s not a new issue either, wp-recaptcha has had a history of breaking XHTML. The thing is, the WordPress plugin (which uses the PHP library by the recaptcha people) has an option to “Be XHTML 1.0 Strict compliant”; but this only fixes the use of iframes and noscript!

Updated: pushed fork to github
Continue reading “XHTML fixes for the WordPress reCAPTCHA plugin”