A Site Indexer for Small Websites

So recently with one of my projects I needed a super simple website search facility. I didn’t want to go with Google because the integration is a bit ugly (and then there are the adverts). I didn’t want to go with Lucene or htDig because they were too heavy for my needs. I also didn’t want to use KinoSearch or Plucene because whilst they’re both great projects they were still over-the-top for what I needed to do.

So I decided to do what all bad programmers do – write my own. How hard can it be?

Well the first thing you need is an indexer. The indexer downloads pages of the target website, processes them for links which it adds to its queue and processes the textual content on the page for keywords which it then flattens out and stores in an inverted index, or one which is fast to go from word to website URL. On the way it can also store some context about the word, such as where it is on the page, whether it’s bold or in a heading and what are the words near it or linking to it.

So the first port of call is http://search.cpan.org/ where I found Scrappy from Al Newkirk. An automated indexer with callback handlers for different HTML tags. Awesome! This would sort all of the web fetching & HTML processing I had to do.

So I started with this:

crawl $starting_url, {
		      'a'  => \&process_url,
		      '/*' => \&process_page,
		     };

The function process_url() takes care of only processing links we like – the right sort of protocols (no mailto, file, ftp, gopher, wais, archie etc.) and the right sorts of files (don’t want to do much with images, css, javascript etc).

 sub process_url {
  my $url = shift->href;
  $url    =~ s{\#.*?$}{}smx;

  if($url =~ m{^(file|mailto|ftp|gopher|wais|archie|nntp|irc|news|telnet|svn)}) {
    #########
    # unsupported protocol
    #
    return;
  }

  if($url !~ m{^$starting_url}) {
    #########
    # external website
    #
    return;
  }

  if($url =~ m{pdf}smix) {
    #########
    # special handling for PDFs
    #
    push @{$pdfs}, $url;
    return;
  }

  if($url =~ m{(png|jpe?g|gif|zip|css|js|docx?|pptx?|xlsx?|odt|odp|ods)$}smix) {
    #########
    # unsupported filetype
    # todo: queue for independent processing
    #
    return;
  }

  queue $url;
}

You’ll notice that if we find a PDF we save it out for later, but generally “normal” HTML links to HTML pages will be added to Scrappy’s queue.

process_page() then looks a little like this:

sub process_page {
  my $page    = page;

  print "processing $page\n";
  my $html    = shift->html;
  my ($title) = $html =~ m{<title>(.*?)</title>}smxi;
  my $text    = $html;
  $text       =~ s{<script (.*?)/script>}{}smxig;
  $text       =~ s{< [^>]+>}{ }smxg;

  return process_text($page, \$text, \$title);
}</script>

This does some pretty ugly stuff – pull out the title, remove all script tags because they’re particularly unpleasant, then attempt to strip out all tags using a pattern match. I suppose this could be compared to brain surgery with a shovel – highly delicate!

The bulk of the text processing is broken out into a separate process_text() function so it can be reused for the PDF processing I want to do later on. It looks something like this:

sub process_text {
  my ($page, $text_ref, $title_ref) = @_;

  $page =~ s{$starting_url}{}smx;

  if($page !~ m{^/}smx) {
    $page = "/$page";
  }

  cleanup_entities($text_ref);
  cleanup_entities($title_ref);

  my @words = grep { /[a-z\d]{3,}/smix } # at least three alphanumerics
              grep { length $_ > 3 }     # longer than three characters
	      map  { s{\s+}{}smxg; $_ }  # trim spaces
	      map  { lc }                # store in lowercase
	      split /\b/smx,             # split on word boundary
	      ${$text_ref};

  for my $word (@words) {
    my $score = $dbh->selectall_arrayref(q[SELECT score from idx WHERE word=? AND page=?], {}, $word, $page)->[0]->[0];

    if(!defined $score) {
      my ($match) = ${$text_ref} =~ /($word)/smix;
      my $before  = substr ${$text_ref}, 0, $-[0];
      my $after   = substr ${$text_ref}, $+[0];
      $after      =~ s/((?:(?:\w+)(?:\W+)){10}).*/$1/;
      $before     = reverse $before; # reverse the string to limit backtracking.
      $before     =~ s/((?:(?:\W+)(?:\w+)){10}).*/$1/;
      $before     = reverse $before;

      my $context = "$before $match $after"; # use $match here instead of $word to retain case
      $context    =~ s/\s+/ /smxg; # switch many spaces for one
      $context    =~ s/^\s+//smxg; # strip leading space
      $context    =~ s/\s+$//smxg; # strip trailing space
      $dbh->do(q[INSERT INTO idx (word,page,title,score,context) values(?,?,?,1,?)], {}, $word, $page, ${$title_ref}, $context);

    } else {
      $dbh->do(q[UPDATE idx SET score=score+1 WHERE word=? AND page=?], {}, $word, $page);
    }
  }

  $dbh->commit;
  return 1;
}

Firstly this routine makes the incoming URL relative as it’s only indexing one site. Next it attempts to munge high characters into html entities using the same sort of shovel-brute-force approach as before. Next comes something a little smarter – split all words, lowercase them, trim them up and filter them to anything alphanumeric and longer than 3 letters.

Then for each one of these words, generate the context string of 10 words before and after and store them to a SQLite (oh, I didn’t mention that, did I?) database together with the count for that word’s presence in this URL/document.

Cool! So I now have a SQLite database containing URL, word, wordcount (score) and context. That’s pretty much all I need for a small website search… except, oh. Those dratted PDFs… Okay, here’s process_pdf().

sub process_pdf {
  my ($page) = @_;
  my $ua     = LWP::UserAgent->new;

  print "processing $page\n";

  $ua->agent('indexer');
  my $response = $ua->get($page);

  if (!$response->is_success) {
    carp $response->status_line;
    return;
  }

  my $tmp      = File::Temp->new;
  my $filename = sprintf q[%s.pdf], $tmp->filename;
  eval {
    open my $fh, q[>], $filename or croak "Error opening $filename: $ERRNO";
    binmode $fh;
    print {$fh} $response->decoded_content or croak "Error writing to $filename: $ERRNO";
    close $fh or croak "Error closing $filename: $ERRNO";
    1;
  } or do {
    carp $EVAL_ERROR;
  };

  my $pdf              = CAM::PDF->new($filename);
  my $npages           = $pdf->numPages();
  my ($short_filename) = $page =~ m{([^/]+)$}smix;
  my $title            = unescape($short_filename);

  for my $pagenum (1..$npages) {
    my $str = $pdf->getPageText($pagenum);
    process_text($page, \$str, \$title);
  }

  unlink $filename;

  return 1;
}

As you can see, we’re fetching the URL using a new LWP::UserAgent and saving it to a new temporary file, fairly run-of-the-mill stuff; then the magic happens, we ask CAM::PDF to process the file and tell us how many pages it has, then iterate over the pages ripping out all the textual content and throwing it back over to our old process_text routine from earlier. Bingo! As a good friend of mine says, “Job’s a good ‘un”.

We’ll see how the site search script works in another post. Here’s the whole code (no, I know it’s not particularly good, especially that atrocious entity handling). Of course it’s fairly easy to extend for other document types too – I’d like to the various (Open)Office formats but it depends what’s supported out of the box from CPAN.

#!/usr/local/bin/perl -T
#########
# Author: rmp@psyphi.net
# Created: 2011-02-04
# Perl Artistic License
#
use strict;
use warnings;
use Scrappy qw(:syntax);
use HTML::Entities;
use DBI;
use Carp;
use Getopt::Long;
use English qw(-no_match_vars);
use LWP::UserAgent;
use File::Temp qw(tempfile);
use CAM::PDF;
use CGI qw(unescape);

my $opts = {};
GetOptions($opts, qw(url=s dbname=s help)) or croak $ERRNO;

if(!$opts->{url} || !$opts->{dbname} || $opts->{help}) {
  print < <EOT;
indexer -url=http://website.to.crawl/ -dbname=database.sqlite [-help]
EOT
  exit;
}

my $pdfs         = [];
my $starting_url = $opts->{url};
my ($dbname)     = $opts->{dbname} =~ m{([a-z\d_/.\-]+)}smix;
my $dbh          = DBI->connect(qq[DBI:SQLite:dbname=$dbname],q[],q[],{
								       RaiseError => 1,
								       AutoCommit => 0,
								      });
eval {
  $dbh->do(q[drop table idx]); # todo: build tmp index and rename
};
$dbh->do(q[create table idx(word char(32), page char(255), title char(64), context char(64), score int)]);

crawl $starting_url, {
		      'a'  => \&process_url,
		      '/*' => \&process_page,
		     };

process_pdfs($pdfs);

sub process_pdfs {
  my ($pdfs) = @_;

  for my $pdf (@{$pdfs}) {
    eval {
      process_pdf($pdf);
      1;
    } or do {
      carp $EVAL_ERROR;
    };
  }
  return 1;
}

sub cleanup_entities {
  my ($str) = @_;

  if(!defined $str) {
    $str = q[];
  }

  encode_entities(${$str});
  $str =~ s/ / /smxg;
  decode_entities(${$str});
  decode_entities(${$str});

  ${$str} =~ s{[^\x20-\xff]}{ }smxig;
  ${$str} =~ s{\s+}{ }smxg;

  return 1;
}

sub process_url {
  my $url = shift->href;
  $url    =~ s{\#.*?$}{}smx;

  if($url =~ m{^(file|mailto|ftp|gopher|wais|archie|nntp|irc|news|telnet|svn)}) {
    #########
    # unsupported protocol
    #
    return;
  }

  if($url !~ m{^$starting_url}) {
    #########
    # external website
    #
    return;
  }

  if($url =~ m{pdf}smix) {
    #########
    # special handling for PDFs
    #
    push @{$pdfs}, $url;
    return;
  }

  if($url =~ m{(png|jpe?g|gif|zip|css|js|docx?|pptx?|xlsx?|odt|odp|ods)$}smix) {
    #########
    # unsupported filetype
    # todo: queue for independent processing
    #
    return;
  }

  queue $url;
}

sub process_page {
  my $page    = page;

  print "processing $page\n";
  my $html    = shift->html;
  my ($title) = $html =~ m{<title>(.*?)</title>}smxi;
  my $text    = $html;
  $text       =~ s{<script (.*?)/script>}{}smxig;
  $text       =~ s{< [^>]+>}{ }smxg;

  return process_text($page, \$text, \$title);
}

sub process_text {
  my ($page, $text_ref, $title_ref) = @_;

  $page =~ s{$starting_url}{}smx;

  if($page !~ m{^/}smx) {
    $page = "/$page";
  }

  cleanup_entities($text_ref);
  cleanup_entities($title_ref);

  my @words = grep { /[a-z\d]{3,}/smix } # at least three alphanumerics
              grep { length $_ > 3 }     # longer than three characters
	      map  { s{\s+}{}smxg; $_ }  # trim spaces
	      map  { lc }                # store in lowercase
	      split /\b/smx,             # split on word boundary
	      ${$text_ref};

  for my $word (@words) {
    my $score = $dbh->selectall_arrayref(q[SELECT score from idx WHERE word=? AND page=?], {}, $word, $page)->[0]->[0];

    if(!defined $score) {
      my ($match) = ${$text_ref} =~ /($word)/smix;
      my $before  = substr ${$text_ref}, 0, $-[0];
      my $after   = substr ${$text_ref}, $+[0];
      $after      =~ s/((?:(?:\w+)(?:\W+)){10}).*/$1/;
      $before     = reverse $before; # reverse the string to limit backtracking.
      $before     =~ s/((?:(?:\W+)(?:\w+)){10}).*/$1/;
      $before     = reverse $before;

      my $context = "$before $match $after"; # use $match here instead of $word to retain case
      $context    =~ s/\s+/ /smxg;
      $context    =~ s/^\s+//smxg;
      $context    =~ s/\s+$//smxg;
      $dbh->do(q[INSERT INTO idx (word,page,title,score,context) values(?,?,?,1,?)], {}, $word, $page, ${$title_ref}, $context);

    } else {
      $dbh->do(q[UPDATE idx SET score=score+1 WHERE word=? AND page=?], {}, $word, $page);
    }
  }

  $dbh->commit;
  return 1;
}

sub process_pdf {
  my ($page) = @_;
  my $ua     = LWP::UserAgent->new;

  print "processing $page\n";

  $ua->agent('indexer');
  my $response = $ua->get($page);

  if (!$response->is_success) {
    carp $response->status_line;
    return;
  }

  my $tmp      = File::Temp->new;
  my $filename = sprintf q[%s.pdf], $tmp->filename;
  eval {
    open my $fh, q[>], $filename or croak "Error opening $filename: $ERRNO";
    binmode $fh;
    print {$fh} $response->decoded_content or croak "Error writing to $filename: $ERRNO";
    close $fh or croak "Error closing $filename: $ERRNO";
    1;
  } or do {
    carp $EVAL_ERROR;
  };

  my $pdf              = CAM::PDF->new($filename);
  my $npages           = $pdf->numPages();
  my ($short_filename) = $page =~ m{([^/]+)$}smix;
  my $title            = unescape($short_filename);

  for my $pagenum (1..$npages) {
    my $str = $pdf->getPageText($pagenum);
    process_text($page, \$str, \$title);
  }

  unlink $filename;

  return 1;
}

</script>

Please note: WordPress keeps eating various bits of the code, like script tags, greater-than and ampersand entities so if you see weird coding errors it’s probably down to that.

Technostalgia

BBC Micro

Ahhhhh, Technostalgia. This evening I pulled out a box from the attic. It contained an instance of the first computer I ever used. A trusty BBC B+ Micro and a whole pile of mods to go with it. What a fabulous piece of kit. Robust workhorse, Econet local-area-networking built-in (but no modem, how forward-thinking!), and a plethora of expansion ports. My admiration of this hardware is difficult to quantify but I wasted years of my life learning how to hack about with it, both hardware and software.

The BBC Micro taught me in- and out- of the classroom. My primary school had one in each classroom and, though those might have been the ‘A’ or ‘B’ models, I distinctly remember one BBC Master somewhere in the school. Those weren’t networked but I remember spraining a thumb in the fourth year of primary school and being off sports for a few weeks. That’s when things really started happening. I taught myself procedural programming using LOGO. I was 10 – a late starter compared to some. I remember one open-day the school borrowed (or dusted off) a turtle

BBC Buggy (Turtle)

Brilliant fun, drawing ridiculous spirograph-style patterns on vast sheets of paper.

When I moved up to secondary school my eyes were opened properly. The computer lab was pretty good too. Networked computers. Fancy that! A network printer and a network fileserver the size of a… not sure what to compare it with – it was a pretty unique form-factor – about a metre long, 3/4 metre wide and about 20cm deep from memory (but I was small back then). Weighed a tonne. A couple of 10- or 20MB Winchesters in it from what I recall. I still have the master key for it somewhere! My school was in Cambridge and had a couple of part-time IT teacher/administrators who seemed to be on loan from SJ Research. Our school was very lucky in that regard – we were used as a test-bed for a bunch of network things from SJ Research, as far as I know a relative of Acorn. Fantastic kit only occasionally let down by the single, core network cable slung overhead between two buildings.

My first experience of Email was using the BBC. We had an internal mail system *POST which was retired after a while, roughly when ARBS left the school I think. I wrote my own MTA back then too, but in BASIC – I must have been about 15 at the time. For internet mail the school had signed up to use something called Interspan which I later realised must have been some sort of bridge to Fidonet or similar.

Teletext Adapter

We even had a networked teletext server which, when working, downloaded teletext pages to the LAN and was able to serve them to anyone who requested them. The OWUKWW – One-way-UK-wide-web! The Music department had a Music 5000 Synth which ran a language called Ample. Goodness knows how many times we played Axel-F on that. Software/computer-programmable keyboard synth – amazing.

Around the same time I started coding in 6502 and wrote some blisteringly fast conversions of simple games I’d earlier written in BASIC. I used to spend days drawing out custom characters on 8×8 squared exercise books. I probably still have them somewhere, in another box in the attic.

6502 coprocessor

Up until this point I’d been without a computer at home. My parents invested in our first home computer. The Atari ST. GEM was quite a leap from the BBC but I’d seen similar things using (I think) the additional co-processors – either the Z80- or the 6502 co-pro allowed you to run a sort of GEM desktop on the Beeb.

My memory is a bit hazy because then the school started throwing out the BBCs and bringing in the first Acorn Archimedes machines. Things of beauty! White, elegant, fast, hot, with a (still!) underappreciated operating system, high colour graphics, decent built-in audio and all sorts of other goodies. We had a Meteosat receiver hooked up to one in the geography department, pulling down WEFAX transmissions. I *still* haven’t got around to doing that at home, and I *still* want to!

Acorn A3000 Publicity Photo
Atari STE Turbo Pack

The ST failed pretty quickly and was replaced under warranty with an STE. Oh the horror – it was already incompatible with several games, but it had a Blitter chip ready to compete with those bloody Amiga zealots. Oh Babylon 5 was rendered on an Amiga. Sure, sure. But how many thousands of hit records had been written using Cubase or Steinberg on the Atari? MIDI – there was a thing. Most people now know MIDI as those annoying, never-quite-sounding-right music files which autoplay, unwarranted, on web pages where you can’t find the ‘mute’ button. Even that view is pretty dated.

Back then MIDI was a revolution. You could even network more than one Atari using it, as well as all your instruments of course. The STE was gradually treated to its fair share of upgrades – 4MB ram and a 100MB (SCSI, I think) hard disk, a “StereoBlaster” cartridge even gave it DSP capabilities for sampling. Awesome. I’m surprised it didn’t burn out from all the games my brothers and I played. I do remember wrecking *many* joysticks.

Like so many others I learned more assembler, 68000 this time, as I’d done with the BBC, by typing out pages and pages of code from books and magazines, spending weeks trying to find the bugs I’d introduced, checking and re-checking code until deciding the book had typos, but GFA Basic was our workhorse. My father had also started programming in GFA, and still did do until about 10 years ago when the Atari was retired.

Then University. First term, first few weeks of first term. I blew my entire student grant, £1400 back then, on my first PC. Pentium 75, 8MB RAM, a 1GB disk and, very important back then, a CD-ROM drive. A Multimedia PC!
It came with Windows 3.11 for Workgroups but with about 6 weeks of work was dual boot with my first Linux install. Slackware.

That one process, installing Slackware Linux with only one book “Que: Introduction to UNIX” probably taught me more about the practicalities of modern operating systems than my entire 3-year BSc in Computer Science (though to be fair, almost no theory of course). I remember shuttling hundreds of floppy disks between my room in halls and the department and/or university computer centre. I also remember the roughly 5% corruption rate and having to figure out the differences between my lack of understanding and buggered files. To be perfectly honest things haven’t changed a huge amount since then. It’s still a daily battle between understanding and buggered files. At least packaging has improved (apt; rpm remains a backwards step but that’s another story) but basically everything’s grown faster. At least these days the urge to stencil-spray-paint my PC case is weaker.

So – how many computers have helped me learn my trade? Well since about 1992 there have been five of significant import. The BBC Micro; the Acorn Archimedes A3000; the Atari ST(E); the Pentium 75 and my first Apple Mac G4 powerbook. And I salute all of them. If only computers today were designed and built with such love and craft. *sniff*.

Required Viewing:

  • Micro Men
  • The Pirates of Silicon Valley

Adventures in UTF-8

I think I’m very nearly at the verge of beginning to understand UTF-8.

Internal UTF-8 string, encoded
Wrong:

sentinel:~ rmp$ perl -MHTML::Entities -e 'print encode_entities("°")'
&Acirc;&deg;

Right:

sentinel:~ rmp$ perl -Mutf8 -MHTML::Entities -e 'print encode_entities("°")'
&deg;

External UTF-8 input, encoded
Wrong:

sentinel:~ rmp$ echo "°" | perl -MHTML::Entities -e 'print encode_entities(<>)'
&Acirc;&deg;

Right:

sentinel:~ rmp$ echo "°" | perl -MHTML::Entities -e 'binmode STDIN, ":utf8"; print encode_entities(<>)'
&deg;

External UTF-8 string, as UTF-8 (unencoded)
Wrong:

sentinel:~ rmp$ echo "°" | perl -e 'binmode STDIN, ":utf8"; print <>'
?

Right:

sentinel:~ rmp$ echo "°" | perl -e 'binmode STDIN, ":utf8"; 
binmode STDOUT, ":utf8"; print <>'
°

External Input – Encoding after-the-fact
Wrong:

sentinel:~ rmp$ echo "°" | perl -Mutf8 -e '$in=<>; utf8::upgrade($in);
binmode STDOUT, ":utf8"; print $in'
°

Wrong:

sentinel:~ rmp$ echo "°" | perl -Mutf8 -e '$in=<>; utf8::encode($in);
binmode STDOUT, ":utf8"; print $in'
°

Wrong:

sentinel:~ rmp$ echo "°" | perl -Mutf8 -e '$in=<>; utf8::downgrade($in); 
binmode STDOUT, ":utf8"; print $in'
°

Right:

sentinel:~ rmp$ echo "°" | perl -Mutf8 -e '$in=<>; utf8::decode($in);
binmode STDOUT, ":utf8"; print $in'
°

Generating MSCACHE & NTLM hashes using Perl

I’ve been doing a lot of tinkering recently whilst working on the revised rainbowcracklimited.com website. Naturally it uses Perl on the back end so I’ve had to find out how to make Windows-style hashes of various types using largely non-native means.

On the whole I’ve been able to make good use of the wealth of CPAN modules – Digest::MD4, Digest::MD5, Digest::SHA and Authen::Passphrase but for one reason and another I’ve wanted to find out how to make NTLM and MSCACHE hashes “by-hand”. It turns out this is pretty easy:

NTLM is just a MD4 digest of the password in Unicode, or to be specific utf16 2-byte characters + surrogates:

perl -M"Unicode::String latin1" -M"Digest::MD4 md4_hex" -e 'print md4_hex(latin1("cheese")->utf16le),"\n"'

MSCACHE is a little bit more fiddly as it also encodes the Unicode username as well:

perl -M"Unicode::String latin1" -M"Digest::MD4 md4_hex" -e 'print md4_hex(latin1("cheese")->utf16le . latin1(lc "Administrator")->utf16le),"\n"'

A Little SQL “EXPLAIN” example

http://www.flickr.com/photos/banyan_tree/2802823634/sizes/m/

Background: I have a big table called “channel” with a few hundred thousand rows in – nothing vast, but big enough to cause some queries to run slower than I want.

Today I was fixing something else and happened to run

show full processlist;

I noticed this taking too long:

SELECT payload FROM channel  LIMIT 398800,100;

This is a query used by some web-app paging code. Stupid really – payload isn’t indexed and there’s no use of any other keys in that query. Ok – how to improve it? First of all, see what EXPLAIN says:

mysql> explain SELECT payload FROM channel  LIMIT 398800,100;
+----+-------------+---------+------+---------------+------+---------+------+--------+-------+
| id | select_type | table   | type | possible_keys | key  | key_len | ref  | rows   | Extra |
+----+-------------+---------+------+---------------+------+---------+------+--------+-------+
|  1 | SIMPLE      | channel | ALL  | NULL          | NULL | NULL    | NULL | 721303 |       |
+----+-------------+---------+------+---------------+------+---------+------+--------+-------+
1 row in set (0.00 sec)

Ok. So the simplest way would be to limit a selection of id_channel (the primary key) then select payloads in that set. First I tried this:

SELECT payload
FROM channel
WHERE id_channel IN (
    SELECT id_channel FROM channel LIMIT 398800,100
);

Seems straightforward, right? No, not really.

ERROR 1235 (42000): This version of MySQL doesn't yet support
  'LIMIT & IN/ALL/ANY/SOME subquery'

Next!

Second attempt, using a temporary table, selecting and saving the id_channels I’m interested in then using those in the actual query:

CREATE TEMPORARY TABLE channel_tmp(
  id_channel BIGINT UNSIGNED NOT NULL PRIMARY KEY
) ENGINE=innodb;

INSERT INTO channel_tmp(id_channel)
  SELECT id_channel
  FROM channel LIMIT 398800,100;

SELECT payload
  FROM channel
  WHERE id_channel IN (
    SELECT id_channel FROM channel_tmp
  );
mysql> explain select id_channel from channel limit 398800,100;
+----+-------------+---------+-------+---------------+---------+---------+------+--------+-------------+
| id | select_type | table   | type  | possible_keys | key     | key_len | ref  | rows   | Extra       |
+----+-------------+---------+-------+---------------+---------+---------+------+--------+-------------+
|  1 | SIMPLE      | channel | index | NULL          | PRIMARY | 8       | NULL | 722583 | Using index |
+----+-------------+---------+-------+---------------+---------+---------+------+--------+-------------+
1 row in set (0.00 sec)
mysql> explain select payload from channel where id_channel in (select id_channel from channel_tmp);
+----+--------------+-------------+-------------+---------------+---------+---------+------+--------+-------------+
| id | select_type  | table       | type            | poss_keys | key     | key_len | ref  | rows   | Extra       |
+----+--------------+-------------+-------------+---------------+---------+---------+------+--------+-------------+
|  1 | PRIMARY      | channel     | ALL             | NULL      | NULL    | NULL    | NULL | 722327 | Using where |
|  2 | DEP SUBQUERY | channel_tmp | unique_subquery | PRIMARY   | PRIMARY | 8       | func |      1 | Using index |
+----+--------------+-------------+-------------+---------------+---------+---------+------+--------+-------------+
2 rows in set (0.00 sec)

Let’s try a self-join doing all of the above without explicitly making a temporary table. Self-joins can be pretty powerful – neat in the right places..

mysql> explain SELECT payload
  FROM channel c1,
       (SELECT id_channel FROM channel limit 398800,100) c2
  WHERE c1.id_channel=c2.id_channel;
+----+-------------+------------+--------+---------------+---------+---------+---------------+--------+-------------+
| id | select_type | table      | type   | possible_keys | key     | key_len | ref           | rows   | Extra       |
+----+-------------+------------+--------+---------------+---------+---------+---------------+--------+-------------+
|  1 | PRIMARY     | derived2   | ALL    | NULL          | NULL    | NULL    | NULL          |    100 |             |
|  1 | PRIMARY     | c1         | eq_ref | PRIMARY       | PRIMARY | 8       | c2.id_channel |      1 |             |
|  2 | DERIVED     | channel    | index  | NULL          | PRIMARY | 8       | NULL          | 721559 | Using index |
+----+-------------+------------+--------+---------------+---------+---------+---------------+--------+-------------+
3 rows in set (0.21 sec)

This pulls out the right rows and even works around the “no limit in subselect” unsupported mysql feature but that id_channel selection in c2 still isn’t quite doing the right thing – I don’t like all the rows being returned, even if they’re coming straight out of the primary key index.

A little bit of rudimentary benchmarking appears to suggest that the self-join is the fastest, followed by the original query at approximately one order of magnitude slower and trailing a long way behind at around another four-times slower than that, the temporary table. I’m not sure how or why the temporary table performance happens to be the slowest – perhaps down to storage access, or more likely my lack of understanding. Some time I might even try the in-memory table too for comparison.

An Interview Question

I’d like to share a basic interview question I’ve used in the past. I’ve used this in a number of different guises over the years, both at Sanger and at ONT but the (very small!) core remains the same. It still seems to be able to trip up a lot of people who sell themselves as senior developers on their CVs and demand £35k+ salaries.

You have a list of characters.

  1. Remove duplicates

The time taken for the interviewee to scratch their head determines whether they’re a Perl programmer, or at least think like one – this is an idomatic question in Perl. It’s a fairly standard solution to anyone who uses hashes, maps or associative arrays in any language. It’s certainly a lot harder without them.

The answer I would expect to see would run something like this:

#########
# pass in an array ref of characters, e.g.
# remove_dupes([qw(a e r o i g n o s e w f e r g e r i g e o n k)]);
#
sub remove_dupes {
  my $chars_in  = shift;
  my $seen      = {};
  my $chars_out = [];

  for my $char (@{$chars_in}) {
    if(!$seen->{$char}++) {
      push @{$chars_out}, $char;
    }
  }

  return $chars_out;
}

Or for the more adventurous, using a string rather than an array:

#########
# pass in a string of characters, e.g.
# remove_dupes(q[uyavubnopwemgnisudhjopwenfbuihrpgbwogpnskbjugisjb]);
#
sub remove_dupes {
  my $str  = shift;
  my $seen = {};
  $str     =~ s/(.)/( !$seen->{$1}++ ) ? $1 : q[]/smegx;
  return $str;
}

The natural progression from Q1 then follows. It should be immediately obvious to the interviewee if they answered Q1 inappropriately.

  1. List duplicates
#########
# pass in an array ref of characters, e.g.
# list_dupes([qw(a e r o i g n o s e w f e r g e r i g e o n k)]);
#
sub list_dupes {
  my $chars_in  = shift;
  my $seen      = {};
  my $chars_out = [];

  for my $char (@{$chars_in}) {
    $seen->{$char}++;
  }

  return [ grep { $seen->{$_} > 1 } keys %{$seen} ];
}

and with a string

#########
# pass in a string of characters, e.g.
# list_dupes(q[uyavubnopwemgnisudhjopwenfbuihrpgbwogpnskbjugisjb]);
#
sub list_dupes {
  my $str  = shift;
  my $seen = {};
  $str     =~ s/(.)/( $seen->{$1}++ > 1) ? $1 : q[]/smegx;
  return $str;
}

The standard follow-up is then “Given more time, what would you do to improve this?”. Well? What would you do? I know what I would do before I even started – WRITE SOME TESTS!

It’s pretty safe to assume that any communicative, personable candidate who starts off writing a test on the board will probably be head and shoulders above any other.

If I’m interviewing you tomorrow and you’re reading this now, it’s also safe to mention it. Interest in the subject and a working knowledge of the intertubes generally comes in handy for a web developer. I’m hiring you as an independent thinker!

On scaling data structures

http://www.flickr.com/photos/paraflyer/1063341049/sizes/m/
http://www.flickr.com/photos/paraflyer/1063341049/sizes/m/

One concept, or data-structure paradigm if you will, which I’ve seen many, many times is the tree. Whatever sort of tree it is – binary or otherwise, it’s a hierarchy of some sort.

Tree structures are very common in every day life. Unfortunately in most programming languages they’re pretty clunky to code up. This aspect, coupled with they’re inherent inflexibility is why I think they don’t belong as a core part of a system which is required to scale horizontally.

The reason I believe this to be the case is because history, experience and current best-practices dictate that to scale well you define your scale-units to be entities which are as distinct from one-another as possible. This separation allows the system to treat them largely as separate units in an embarrassingly-parallel fashion, especially for functions like the popular MapReduce.

Tree structures simply don’t fit in to this paradigm. With any sort of hierarchy you end up binding each entity tightly to its parent, which in turn is bound tightly to each of its children. Think about what needs to happen if your tree is spread across many servers and you want to change a property of the tree, like the name of a node, or worse, delete a node and shuffle some of its children around. The complexity of this sort of manoeuvre is why trees don’t belong as primary keys of scalable systems. It’s also one of the reasons why document-store-databases like CouchDB are linear, key:value stores.

Web Frameworking

It seems to be the wrong time to be reading such things, but over on InfoQ there’s a nice article introducing web development of RESTful services using Erlang and the Yaws high performance web server.

I say “the wrong time” as this week has kicked off the “Advancing with Rails” course by David A. Black of Ruby Power and Light fame. The course is fairly advanced in terms of required rails knowledge so it’s a bit of a baptism by fire for me and a few others having never written any Ruby before.

Rails is proving moderately easy to pick up but as I’ve remarked to a couple of people, it doesn’t seem any easier coding with Rails than with Perl. Perhaps it’s because I’ve never done it before but I reckon it’s a lot harder spending my time figuring out what the heck DHH meant something to do than it is doing it myself.

Even though it’s nowhere near as mature, I do reckon my ClearPress framework has a lot going for it – it’s pretty feature-complete in terms of ORM, views and templating ( TT2 ). It has similar convention over configuration features meaning it’s not designed for plugging in other alternative layers but it is absolutely possible to do (and I suspect without as much effort as is required in Rails). I still need to iron out some wrinkles in the autogenerated code from the application builder and provide some default authorisation and authentication mechanisms, some of which may come in the next release. But in the meantime it’s easy to add these features, which is exactly what we’ve done for the new sequencing run tracking app, NPG to tie it to the WTSI website single sign on (MySQL and LDAP under the hood).

The Importance of Profiling

I’ve worked as a software developer and worked with teams of software developers for around 10 years now, Many of those whom I’ve worked with have earned my trust and respect in relation to development and testing techniques. Frustratingly however it’s still with irritating regularity that I hear throw-away comments bourne of uncertainty and ignorance.

A couple of times now I’ve specifically been told that “GD makes my code go slow”. Now for those of you not in the know GD (actually specifically Lincoln Stein’s GD.pm in perl) is a wrapper around Tom Boutell’s most marvellous libgd graphics library. The combination of these two has always performed excellently for me and never been the bottleneck in any of my applications. The applications in question are usually database-backed web applications with graphics components for plotting genomic features or charts of one sort or another.

As any database-application developer will tell you, the database, or network connection to the database is almost always the bottleneck in an application or service. Great efforts are made to ensure database services scale well and perform as efficiently as possible, but even after these improvements are made they usually simply delay the inevitable.

Hence my frustration when I hear that “GD is making my (database) application go slow”. How? Where? Why? Where’s the proof? It’s no use blaming something, a library in this case, that’s out of your control. It’s hard to believe a claim like that without some sort of measurement.

So.. before pointing the finger, profile the code and make an effort to understand what the profiler is doing. In database applications profile your queries – use EXPLAIN, add indices, record SQL transcripts and time the results. Then profile the code which is manipulating those results.

Once the results are in of course, concentrate in the first instance on the parts with the most impact (e.g. 0.1 second off each iteration of a 1000x loop rather than 1 second from /int main/ ) – the low hanging fruit. Good programmers should be relatively lazy and speeding up code with the least amount of effort should be commonsense.

Great pieces of code

A lot of what I do day-to-day is related to optimisation. Be it Perl code, SQL queries, Javascript or HTML there are usually at least a couple of cracking examples I find every week. On Friday I came across this:

SELECT cycle FROM goldcrest WHERE id_run = ?

This query is being used to find the number of the latest cycles (between 1 and 37 for each id_run) in a near-real-time tracking system and is used several times whenever a run report is viewed.

EXPLAIN SELECT cycle FROM goldcrest WHERE id_run = 231;
  
+----+-------------+-----------+------+---------------+---------+---------+-------+--------+-------------+
| id | select_type | table     | type | possible_keys | key     | key_len | ref   | rows   | Extra       |
+----+-------------+-----------+------+---------------+---------+---------+-------+--------+-------------+
|  1 | SIMPLE      | goldcrest | ref  | g_idrun       | g_idrun |       8 | const | 262792 | Using where |
+----+-------------+-----------+------+---------------+---------+---------+-------+--------+-------------+

In itself this would be fine but the goldcrest table in this instance contains several thousand rows for each id_run. So, for id_run, let’s say, 231 this query happens to return approximately 588,000 rows to determine that the latest cycle for run 231 is the number 34.

To clean this up we first try something like this:

SELECT MIN(cycle),MAX(cycle) FROM goldcrest WHERE id_run = ?

which still scans the 588000 rows (keyed on id_run incidentally) but doesn’t actually return them to the user, only one row containing both values we’re interested in. Fair enough, the CPU and disk access penalties are similar but the data transfer penalty is significantly improved.

Next I try adding an index against the id_run and cycle columns:

ALTER TABLE goldcrest ADD INDEX(id_run,cycle);
Query OK, 37589514 rows affected (23 min 6.17 sec)
Records: 37589514  Duplicates: 0  Warnings: 0

Now this of course takes a long time and, because the tuples are fairly redundant, creates a relatively inefficient index, also penalising future INSERTs. However, casually ignoring those facts, our query performance is now radically different:

EXPLAIN SELECT MIN(cycle),MAX(cycle) FROM goldcrest WHERE id_run = 231;
  
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                        |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
|  1 | SIMPLE      | NULL  | NULL | NULL          | NULL |    NULL | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
  
SELECT MIN(cycle),MAX(cycle) FROM goldcrest WHERE id_run = 231;
+------------+------------+
| MIN(cycle) | MAX(cycle) |
+------------+------------+
|          1 |         37 |
+------------+------------+
  
1 row in set (0.01 sec)

That looks a lot better to me now!

Generally I try to steer clear of the mysterious internal workings of database engines, but with much greater frequency come across examples like this:

sub clone_type {
  my ($self, $clone_type, $clone) = @_;
  my %clone_type;

  if($clone and $clone_type) {
    $clone_type{$clone} = $clone_type;
    return $clone_type{$clone};
  }

  return;
}

Thankfully this one’s pretty quick to figure out – they’re usually *much* more convoluted, but still.. Huh??

Pass in a clone_type scalar, create a local hash with the same name (Argh!), store the clone_type scalar in the hash keyed at position $clone, then return the same value we just stored.

I don’t get it… maybe a global hash or something else would make sense, but this works out the same:

sub clone_type {
  my ($self, $clone_type, $clone) = @_;

  if($clone and $clone_type) {
    return $clone_type;
  }
  return;
}

and I’m still not sure why you’d want to do that if you have the values on the way in already.

Programmers really need to think around the problem, not just through it. Thinking through may result in functionality but thinking around results in both function and performance which means a whole lot more in my book, and incidentally, why it seems so hard to hire good programmers.