A Site Indexer for Small Websites

So recently with one of my projects I needed a super simple website search facility. I didn’t want to go with Google because the integration is a bit ugly (and then there are the adverts). I didn’t want to go with Lucene or htDig because they were too heavy for my needs. I also didn’t want to use KinoSearch or Plucene because whilst they’re both great projects they were still over-the-top for what I needed to do.

So I decided to do what all bad programmers do – write my own. How hard can it be?

Well the first thing you need is an indexer. The indexer downloads pages of the target website, processes them for links which it adds to its queue and processes the textual content on the page for keywords which it then flattens out and stores in an inverted index, or one which is fast to go from word to website URL. On the way it can also store some context about the word, such as where it is on the page, whether it’s bold or in a heading and what are the words near it or linking to it.

So the first port of call is http://search.cpan.org/ where I found Scrappy from Al Newkirk. An automated indexer with callback handlers for different HTML tags. Awesome! This would sort all of the web fetching & HTML processing I had to do.

So I started with this:

crawl $starting_url, {
		      'a'  => \&process_url,
		      '/*' => \&process_page,
		     };

The function process_url() takes care of only processing links we like – the right sort of protocols (no mailto, file, ftp, gopher, wais, archie etc.) and the right sorts of files (don’t want to do much with images, css, javascript etc).

 sub process_url {
  my $url = shift->href;
  $url    =~ s{\#.*?$}{}smx;

  if($url =~ m{^(file|mailto|ftp|gopher|wais|archie|nntp|irc|news|telnet|svn)}) {
    #########
    # unsupported protocol
    #
    return;
  }

  if($url !~ m{^$starting_url}) {
    #########
    # external website
    #
    return;
  }

  if($url =~ m{pdf}smix) {
    #########
    # special handling for PDFs
    #
    push @{$pdfs}, $url;
    return;
  }

  if($url =~ m{(png|jpe?g|gif|zip|css|js|docx?|pptx?|xlsx?|odt|odp|ods)$}smix) {
    #########
    # unsupported filetype
    # todo: queue for independent processing
    #
    return;
  }

  queue $url;
}

You’ll notice that if we find a PDF we save it out for later, but generally “normal” HTML links to HTML pages will be added to Scrappy’s queue.

process_page() then looks a little like this:

sub process_page {
  my $page    = page;

  print "processing $page\n";
  my $html    = shift->html;
  my ($title) = $html =~ m{<title>(.*?)</title>}smxi;
  my $text    = $html;
  $text       =~ s{<script (.*?)/script>}{}smxig;
  $text       =~ s{< [^>]+>}{ }smxg;

  return process_text($page, \$text, \$title);
}</script>

This does some pretty ugly stuff – pull out the title, remove all script tags because they’re particularly unpleasant, then attempt to strip out all tags using a pattern match. I suppose this could be compared to brain surgery with a shovel – highly delicate!

The bulk of the text processing is broken out into a separate process_text() function so it can be reused for the PDF processing I want to do later on. It looks something like this:

sub process_text {
  my ($page, $text_ref, $title_ref) = @_;

  $page =~ s{$starting_url}{}smx;

  if($page !~ m{^/}smx) {
    $page = "/$page";
  }

  cleanup_entities($text_ref);
  cleanup_entities($title_ref);

  my @words = grep { /[a-z\d]{3,}/smix } # at least three alphanumerics
              grep { length $_ > 3 }     # longer than three characters
	      map  { s{\s+}{}smxg; $_ }  # trim spaces
	      map  { lc }                # store in lowercase
	      split /\b/smx,             # split on word boundary
	      ${$text_ref};

  for my $word (@words) {
    my $score = $dbh->selectall_arrayref(q[SELECT score from idx WHERE word=? AND page=?], {}, $word, $page)->[0]->[0];

    if(!defined $score) {
      my ($match) = ${$text_ref} =~ /($word)/smix;
      my $before  = substr ${$text_ref}, 0, $-[0];
      my $after   = substr ${$text_ref}, $+[0];
      $after      =~ s/((?:(?:\w+)(?:\W+)){10}).*/$1/;
      $before     = reverse $before; # reverse the string to limit backtracking.
      $before     =~ s/((?:(?:\W+)(?:\w+)){10}).*/$1/;
      $before     = reverse $before;

      my $context = "$before $match $after"; # use $match here instead of $word to retain case
      $context    =~ s/\s+/ /smxg; # switch many spaces for one
      $context    =~ s/^\s+//smxg; # strip leading space
      $context    =~ s/\s+$//smxg; # strip trailing space
      $dbh->do(q[INSERT INTO idx (word,page,title,score,context) values(?,?,?,1,?)], {}, $word, $page, ${$title_ref}, $context);

    } else {
      $dbh->do(q[UPDATE idx SET score=score+1 WHERE word=? AND page=?], {}, $word, $page);
    }
  }

  $dbh->commit;
  return 1;
}

Firstly this routine makes the incoming URL relative as it’s only indexing one site. Next it attempts to munge high characters into html entities using the same sort of shovel-brute-force approach as before. Next comes something a little smarter – split all words, lowercase them, trim them up and filter them to anything alphanumeric and longer than 3 letters.

Then for each one of these words, generate the context string of 10 words before and after and store them to a SQLite (oh, I didn’t mention that, did I?) database together with the count for that word’s presence in this URL/document.

Cool! So I now have a SQLite database containing URL, word, wordcount (score) and context. That’s pretty much all I need for a small website search… except, oh. Those dratted PDFs… Okay, here’s process_pdf().

sub process_pdf {
  my ($page) = @_;
  my $ua     = LWP::UserAgent->new;

  print "processing $page\n";

  $ua->agent('indexer');
  my $response = $ua->get($page);

  if (!$response->is_success) {
    carp $response->status_line;
    return;
  }

  my $tmp      = File::Temp->new;
  my $filename = sprintf q[%s.pdf], $tmp->filename;
  eval {
    open my $fh, q[>], $filename or croak "Error opening $filename: $ERRNO";
    binmode $fh;
    print {$fh} $response->decoded_content or croak "Error writing to $filename: $ERRNO";
    close $fh or croak "Error closing $filename: $ERRNO";
    1;
  } or do {
    carp $EVAL_ERROR;
  };

  my $pdf              = CAM::PDF->new($filename);
  my $npages           = $pdf->numPages();
  my ($short_filename) = $page =~ m{([^/]+)$}smix;
  my $title            = unescape($short_filename);

  for my $pagenum (1..$npages) {
    my $str = $pdf->getPageText($pagenum);
    process_text($page, \$str, \$title);
  }

  unlink $filename;

  return 1;
}

As you can see, we’re fetching the URL using a new LWP::UserAgent and saving it to a new temporary file, fairly run-of-the-mill stuff; then the magic happens, we ask CAM::PDF to process the file and tell us how many pages it has, then iterate over the pages ripping out all the textual content and throwing it back over to our old process_text routine from earlier. Bingo! As a good friend of mine says, “Job’s a good ‘un”.

We’ll see how the site search script works in another post. Here’s the whole code (no, I know it’s not particularly good, especially that atrocious entity handling). Of course it’s fairly easy to extend for other document types too – I’d like to the various (Open)Office formats but it depends what’s supported out of the box from CPAN.

#!/usr/local/bin/perl -T
#########
# Author: rmp@psyphi.net
# Created: 2011-02-04
# Perl Artistic License
#
use strict;
use warnings;
use Scrappy qw(:syntax);
use HTML::Entities;
use DBI;
use Carp;
use Getopt::Long;
use English qw(-no_match_vars);
use LWP::UserAgent;
use File::Temp qw(tempfile);
use CAM::PDF;
use CGI qw(unescape);

my $opts = {};
GetOptions($opts, qw(url=s dbname=s help)) or croak $ERRNO;

if(!$opts->{url} || !$opts->{dbname} || $opts->{help}) {
  print < <EOT;
indexer -url=http://website.to.crawl/ -dbname=database.sqlite [-help]
EOT
  exit;
}

my $pdfs         = [];
my $starting_url = $opts->{url};
my ($dbname)     = $opts->{dbname} =~ m{([a-z\d_/.\-]+)}smix;
my $dbh          = DBI->connect(qq[DBI:SQLite:dbname=$dbname],q[],q[],{
								       RaiseError => 1,
								       AutoCommit => 0,
								      });
eval {
  $dbh->do(q[drop table idx]); # todo: build tmp index and rename
};
$dbh->do(q[create table idx(word char(32), page char(255), title char(64), context char(64), score int)]);

crawl $starting_url, {
		      'a'  => \&process_url,
		      '/*' => \&process_page,
		     };

process_pdfs($pdfs);

sub process_pdfs {
  my ($pdfs) = @_;

  for my $pdf (@{$pdfs}) {
    eval {
      process_pdf($pdf);
      1;
    } or do {
      carp $EVAL_ERROR;
    };
  }
  return 1;
}

sub cleanup_entities {
  my ($str) = @_;

  if(!defined $str) {
    $str = q[];
  }

  encode_entities(${$str});
  $str =~ s/ / /smxg;
  decode_entities(${$str});
  decode_entities(${$str});

  ${$str} =~ s{[^\x20-\xff]}{ }smxig;
  ${$str} =~ s{\s+}{ }smxg;

  return 1;
}

sub process_url {
  my $url = shift->href;
  $url    =~ s{\#.*?$}{}smx;

  if($url =~ m{^(file|mailto|ftp|gopher|wais|archie|nntp|irc|news|telnet|svn)}) {
    #########
    # unsupported protocol
    #
    return;
  }

  if($url !~ m{^$starting_url}) {
    #########
    # external website
    #
    return;
  }

  if($url =~ m{pdf}smix) {
    #########
    # special handling for PDFs
    #
    push @{$pdfs}, $url;
    return;
  }

  if($url =~ m{(png|jpe?g|gif|zip|css|js|docx?|pptx?|xlsx?|odt|odp|ods)$}smix) {
    #########
    # unsupported filetype
    # todo: queue for independent processing
    #
    return;
  }

  queue $url;
}

sub process_page {
  my $page    = page;

  print "processing $page\n";
  my $html    = shift->html;
  my ($title) = $html =~ m{<title>(.*?)</title>}smxi;
  my $text    = $html;
  $text       =~ s{<script (.*?)/script>}{}smxig;
  $text       =~ s{< [^>]+>}{ }smxg;

  return process_text($page, \$text, \$title);
}

sub process_text {
  my ($page, $text_ref, $title_ref) = @_;

  $page =~ s{$starting_url}{}smx;

  if($page !~ m{^/}smx) {
    $page = "/$page";
  }

  cleanup_entities($text_ref);
  cleanup_entities($title_ref);

  my @words = grep { /[a-z\d]{3,}/smix } # at least three alphanumerics
              grep { length $_ > 3 }     # longer than three characters
	      map  { s{\s+}{}smxg; $_ }  # trim spaces
	      map  { lc }                # store in lowercase
	      split /\b/smx,             # split on word boundary
	      ${$text_ref};

  for my $word (@words) {
    my $score = $dbh->selectall_arrayref(q[SELECT score from idx WHERE word=? AND page=?], {}, $word, $page)->[0]->[0];

    if(!defined $score) {
      my ($match) = ${$text_ref} =~ /($word)/smix;
      my $before  = substr ${$text_ref}, 0, $-[0];
      my $after   = substr ${$text_ref}, $+[0];
      $after      =~ s/((?:(?:\w+)(?:\W+)){10}).*/$1/;
      $before     = reverse $before; # reverse the string to limit backtracking.
      $before     =~ s/((?:(?:\W+)(?:\w+)){10}).*/$1/;
      $before     = reverse $before;

      my $context = "$before $match $after"; # use $match here instead of $word to retain case
      $context    =~ s/\s+/ /smxg;
      $context    =~ s/^\s+//smxg;
      $context    =~ s/\s+$//smxg;
      $dbh->do(q[INSERT INTO idx (word,page,title,score,context) values(?,?,?,1,?)], {}, $word, $page, ${$title_ref}, $context);

    } else {
      $dbh->do(q[UPDATE idx SET score=score+1 WHERE word=? AND page=?], {}, $word, $page);
    }
  }

  $dbh->commit;
  return 1;
}

sub process_pdf {
  my ($page) = @_;
  my $ua     = LWP::UserAgent->new;

  print "processing $page\n";

  $ua->agent('indexer');
  my $response = $ua->get($page);

  if (!$response->is_success) {
    carp $response->status_line;
    return;
  }

  my $tmp      = File::Temp->new;
  my $filename = sprintf q[%s.pdf], $tmp->filename;
  eval {
    open my $fh, q[>], $filename or croak "Error opening $filename: $ERRNO";
    binmode $fh;
    print {$fh} $response->decoded_content or croak "Error writing to $filename: $ERRNO";
    close $fh or croak "Error closing $filename: $ERRNO";
    1;
  } or do {
    carp $EVAL_ERROR;
  };

  my $pdf              = CAM::PDF->new($filename);
  my $npages           = $pdf->numPages();
  my ($short_filename) = $page =~ m{([^/]+)$}smix;
  my $title            = unescape($short_filename);

  for my $pagenum (1..$npages) {
    my $str = $pdf->getPageText($pagenum);
    process_text($page, \$str, \$title);
  }

  unlink $filename;

  return 1;
}

</script>

Please note: WordPress keeps eating various bits of the code, like script tags, greater-than and ampersand entities so if you see weird coding errors it’s probably down to that.

Generating PDFs from .docs, .ppts and friends

So a few weeks ago I set up a system to fix a standing problem I’ve had for a looong time. We have an internal web-based repository for tracking versions of documentation. It takes any sorts of files but the majority of them seem to end up as .docs and .ppts.

That’s fine for the minority of our users on Windows but by far and away most are on UNIX – primarily Linux & Tru64 – and they can’t do much with those formats. Ok, so some of them have OpenOffice installed but for those who do it’s still a pretty bulky application to startup to view these things quickly.

The answer it seemed was to convert them on-the-fly or on-update to a format which is more easily viewed. PDF seemed the obvious choice, but how to generate? Initially we had a copy of Acrobat Distiller running on a Windows PC watching a shared folder on a network drive but that really isn’t the right solution. The wvPS suite was also considered but at the time the webservers were running on Alpha/Tru64 and the suite didn’t compile up easily.

Last week one of the queries I had reminded me that I still didn’t have a solution to this problem. Googling around a bit I found this thread which describes a macro to perform the right sort of thing. Combining this with the openoffice commandline options -invisible -headless -nologo you’d think this would work without also requiring X. Unfortunately openoffice still wants a display – presumably to access fonts, but I’m not sure.

Enter Xvfb, the X virtual framebuffer. This is a virtual X display which effectively runs headless. You can screengrab from it using the xwud tool (another part of the Xfree86 distribution). Running Xvfb on a given display and telling openoffice to open on that display means the whole system can run headlessly and is scriptable at last. The machines running the webservers are significantly more beefy than those on people’s desks so things don&#39;t appear to be too slow to kick off.

Another useful thing to do is to duplicate the OpenOffice macro to support other file-types and selecting the right macro via a command-line switch based on the extension of the file to convert. It seems we have a workable solution and relatively flexible one at that. Conversion quality is generally pretty good too. Here’s looking forward to OOo version 2!

… of course the only thing is that now the web systems are linux-based wvPS will work properly. Doh. The consolation is that we can do PPTs, RTFs and of course the native StarOffice and OpenOffice formats and other weird and wonderful things.