February, 2011 - psyphi.net blog

So recently with one of my projects I needed a super simple website search facility. I didn’t want to go with Google because the integration is a bit ugly (and then there are the adverts). I didn’t want to go with Lucene or htDig because they were too heavy for my needs. I also didn’t want to use KinoSearch or Plucene because whilst they’re both great projects they were still over-the-top for what I needed to do.

So I decided to do what all bad programmers do – write my own. How hard can it be?

Well the first thing you need is an indexer. The indexer downloads pages of the target website, processes them for links which it adds to its queue and processes the textual content on the page for keywords which it then flattens out and stores in an inverted index, or one which is fast to go from word to website URL. On the way it can also store some context about the word, such as where it is on the page, whether it’s bold or in a heading and what are the words near it or linking to it.

So the first port of call is http://search.cpan.org/ where I found Scrappy from Al Newkirk. An automated indexer with callback handlers for different HTML tags. Awesome! This would sort all of the web fetching & HTML processing I had to do.

So I started with this:

crawl $starting_url, {
		      'a'  => \&process_url,
		      '/*' => \&process_page,
		     };

The function process_url() takes care of only processing links we like – the right sort of protocols (no mailto, file, ftp, gopher, wais, archie etc.) and the right sorts of files (don’t want to do much with images, css, javascript etc).

 sub process_url {
  my $url = shift->href;
  $url    =~ s{\#.*?$}{}smx;

  if($url =~ m{^(file|mailto|ftp|gopher|wais|archie|nntp|irc|news|telnet|svn)}) {
    #########
    # unsupported protocol
    #
    return;
  }

  if($url !~ m{^$starting_url}) {
    #########
    # external website
    #
    return;
  }

  if($url =~ m{pdf}smix) {
    #########
    # special handling for PDFs
    #
    push @{$pdfs}, $url;
    return;
  }

  if($url =~ m{(png|jpe?g|gif|zip|css|js|docx?|pptx?|xlsx?|odt|odp|ods)$}smix) {
    #########
    # unsupported filetype
    # todo: queue for independent processing
    #
    return;
  }

  queue $url;
}

You’ll notice that if we find a PDF we save it out for later, but generally “normal” HTML links to HTML pages will be added to Scrappy’s queue.

process_page() then looks a little like this:

sub process_page {
  my $page    = page;

  print "processing $page\n";
  my $html    = shift->html;
  my ($title) = $html =~ m{<title>(.*?)</title>}smxi;
  my $text    = $html;
  $text       =~ s{<script (.*?)/script>}{}smxig;
  $text       =~ s{< [^>]+>}{ }smxg;

  return process_text($page, \$text, \$title);
}</script>

This does some pretty ugly stuff – pull out the title, remove all script tags because they’re particularly unpleasant, then attempt to strip out all tags using a pattern match. I suppose this could be compared to brain surgery with a shovel – highly delicate!

The bulk of the text processing is broken out into a separate process_text() function so it can be reused for the PDF processing I want to do later on. It looks something like this:

sub process_text {
  my ($page, $text_ref, $title_ref) = @_;

  $page =~ s{$starting_url}{}smx;

  if($page !~ m{^/}smx) {
    $page = "/$page";
  }

  cleanup_entities($text_ref);
  cleanup_entities($title_ref);

  my @words = grep { /[a-z\d]{3,}/smix } # at least three alphanumerics
              grep { length $_ > 3 }     # longer than three characters
	      map  { s{\s+}{}smxg; $_ }  # trim spaces
	      map  { lc }                # store in lowercase
	      split /\b/smx,             # split on word boundary
	      ${$text_ref};

  for my $word (@words) {
    my $score = $dbh->selectall_arrayref(q[SELECT score from idx WHERE word=? AND page=?], {}, $word, $page)->[0]->[0];

    if(!defined $score) {
      my ($match) = ${$text_ref} =~ /($word)/smix;
      my $before  = substr ${$text_ref}, 0, $-[0];
      my $after   = substr ${$text_ref}, $+[0];
      $after      =~ s/((?:(?:\w+)(?:\W+)){10}).*/$1/;
      $before     = reverse $before; # reverse the string to limit backtracking.
      $before     =~ s/((?:(?:\W+)(?:\w+)){10}).*/$1/;
      $before     = reverse $before;

      my $context = "$before $match $after"; # use $match here instead of $word to retain case
      $context    =~ s/\s+/ /smxg; # switch many spaces for one
      $context    =~ s/^\s+//smxg; # strip leading space
      $context    =~ s/\s+$//smxg; # strip trailing space
      $dbh->do(q[INSERT INTO idx (word,page,title,score,context) values(?,?,?,1,?)], {}, $word, $page, ${$title_ref}, $context);

    } else {
      $dbh->do(q[UPDATE idx SET score=score+1 WHERE word=? AND page=?], {}, $word, $page);
    }
  }

  $dbh->commit;
  return 1;
}

Firstly this routine makes the incoming URL relative as it’s only indexing one site. Next it attempts to munge high characters into html entities using the same sort of shovel-brute-force approach as before. Next comes something a little smarter – split all words, lowercase them, trim them up and filter them to anything alphanumeric and longer than 3 letters.

Then for each one of these words, generate the context string of 10 words before and after and store them to a SQLite (oh, I didn’t mention that, did I?) database together with the count for that word’s presence in this URL/document.

Cool! So I now have a SQLite database containing URL, word, wordcount (score) and context. That’s pretty much all I need for a small website search… except, oh. Those dratted PDFs… Okay, here’s process_pdf().

sub process_pdf {
  my ($page) = @_;
  my $ua     = LWP::UserAgent->new;

  print "processing $page\n";

  $ua->agent('indexer');
  my $response = $ua->get($page);

  if (!$response->is_success) {
    carp $response->status_line;
    return;
  }

  my $tmp      = File::Temp->new;
  my $filename = sprintf q[%s.pdf], $tmp->filename;
  eval {
    open my $fh, q[>], $filename or croak "Error opening $filename: $ERRNO";
    binmode $fh;
    print {$fh} $response->decoded_content or croak "Error writing to $filename: $ERRNO";
    close $fh or croak "Error closing $filename: $ERRNO";
    1;
  } or do {
    carp $EVAL_ERROR;
  };

  my $pdf              = CAM::PDF->new($filename);
  my $npages           = $pdf->numPages();
  my ($short_filename) = $page =~ m{([^/]+)$}smix;
  my $title            = unescape($short_filename);

  for my $pagenum (1..$npages) {
    my $str = $pdf->getPageText($pagenum);
    process_text($page, \$str, \$title);
  }

  unlink $filename;

  return 1;
}

As you can see, we’re fetching the URL using a new LWP::UserAgent and saving it to a new temporary file, fairly run-of-the-mill stuff; then the magic happens, we ask CAM::PDF to process the file and tell us how many pages it has, then iterate over the pages ripping out all the textual content and throwing it back over to our old process_text routine from earlier. Bingo! As a good friend of mine says, “Job’s a good ‘un”.

We’ll see how the site search script works in another post. Here’s the whole code (no, I know it’s not particularly good, especially that atrocious entity handling). Of course it’s fairly easy to extend for other document types too – I’d like to the various (Open)Office formats but it depends what’s supported out of the box from CPAN.

#!/usr/local/bin/perl -T
#########
# Author: rmp@psyphi.net
# Created: 2011-02-04
# Perl Artistic License
#
use strict;
use warnings;
use Scrappy qw(:syntax);
use HTML::Entities;
use DBI;
use Carp;
use Getopt::Long;
use English qw(-no_match_vars);
use LWP::UserAgent;
use File::Temp qw(tempfile);
use CAM::PDF;
use CGI qw(unescape);

my $opts = {};
GetOptions($opts, qw(url=s dbname=s help)) or croak $ERRNO;

if(!$opts->{url} || !$opts->{dbname} || $opts->{help}) {
  print < <EOT;
indexer -url=http://website.to.crawl/ -dbname=database.sqlite [-help]
EOT
  exit;
}

my $pdfs         = [];
my $starting_url = $opts->{url};
my ($dbname)     = $opts->{dbname} =~ m{([a-z\d_/.\-]+)}smix;
my $dbh          = DBI->connect(qq[DBI:SQLite:dbname=$dbname],q[],q[],{
								       RaiseError => 1,
								       AutoCommit => 0,
								      });
eval {
  $dbh->do(q[drop table idx]); # todo: build tmp index and rename
};
$dbh->do(q[create table idx(word char(32), page char(255), title char(64), context char(64), score int)]);

crawl $starting_url, {
		      'a'  => \&process_url,
		      '/*' => \&process_page,
		     };

process_pdfs($pdfs);

sub process_pdfs {
  my ($pdfs) = @_;

  for my $pdf (@{$pdfs}) {
    eval {
      process_pdf($pdf);
      1;
    } or do {
      carp $EVAL_ERROR;
    };
  }
  return 1;
}

sub cleanup_entities {
  my ($str) = @_;

  if(!defined $str) {
    $str = q[];
  }

  encode_entities(${$str});
  $str =~ s/Â / /smxg;
  decode_entities(${$str});
  decode_entities(${$str});

  ${$str} =~ s{[^\x20-\xff]}{ }smxig;
  ${$str} =~ s{\s+}{ }smxg;

  return 1;
}

sub process_url {
  my $url = shift->href;
  $url    =~ s{\#.*?$}{}smx;

  if($url =~ m{^(file|mailto|ftp|gopher|wais|archie|nntp|irc|news|telnet|svn)}) {
    #########
    # unsupported protocol
    #
    return;
  }

  if($url !~ m{^$starting_url}) {
    #########
    # external website
    #
    return;
  }

  if($url =~ m{pdf}smix) {
    #########
    # special handling for PDFs
    #
    push @{$pdfs}, $url;
    return;
  }

  if($url =~ m{(png|jpe?g|gif|zip|css|js|docx?|pptx?|xlsx?|odt|odp|ods)$}smix) {
    #########
    # unsupported filetype
    # todo: queue for independent processing
    #
    return;
  }

  queue $url;
}

sub process_page {
  my $page    = page;

  print "processing $page\n";
  my $html    = shift->html;
  my ($title) = $html =~ m{<title>(.*?)</title>}smxi;
  my $text    = $html;
  $text       =~ s{<script (.*?)/script>}{}smxig;
  $text       =~ s{< [^>]+>}{ }smxg;

  return process_text($page, \$text, \$title);
}

sub process_text {
  my ($page, $text_ref, $title_ref) = @_;

  $page =~ s{$starting_url}{}smx;

  if($page !~ m{^/}smx) {
    $page = "/$page";
  }

  cleanup_entities($text_ref);
  cleanup_entities($title_ref);

  my @words = grep { /[a-z\d]{3,}/smix } # at least three alphanumerics
              grep { length $_ > 3 }     # longer than three characters
	      map  { s{\s+}{}smxg; $_ }  # trim spaces
	      map  { lc }                # store in lowercase
	      split /\b/smx,             # split on word boundary
	      ${$text_ref};

  for my $word (@words) {
    my $score = $dbh->selectall_arrayref(q[SELECT score from idx WHERE word=? AND page=?], {}, $word, $page)->[0]->[0];

    if(!defined $score) {
      my ($match) = ${$text_ref} =~ /($word)/smix;
      my $before  = substr ${$text_ref}, 0, $-[0];
      my $after   = substr ${$text_ref}, $+[0];
      $after      =~ s/((?:(?:\w+)(?:\W+)){10}).*/$1/;
      $before     = reverse $before; # reverse the string to limit backtracking.
      $before     =~ s/((?:(?:\W+)(?:\w+)){10}).*/$1/;
      $before     = reverse $before;

      my $context = "$before $match $after"; # use $match here instead of $word to retain case
      $context    =~ s/\s+/ /smxg;
      $context    =~ s/^\s+//smxg;
      $context    =~ s/\s+$//smxg;
      $dbh->do(q[INSERT INTO idx (word,page,title,score,context) values(?,?,?,1,?)], {}, $word, $page, ${$title_ref}, $context);

    } else {
      $dbh->do(q[UPDATE idx SET score=score+1 WHERE word=? AND page=?], {}, $word, $page);
    }
  }

  $dbh->commit;
  return 1;
}

sub process_pdf {
  my ($page) = @_;
  my $ua     = LWP::UserAgent->new;

  print "processing $page\n";

  $ua->agent('indexer');
  my $response = $ua->get($page);

  if (!$response->is_success) {
    carp $response->status_line;
    return;
  }

  my $tmp      = File::Temp->new;
  my $filename = sprintf q[%s.pdf], $tmp->filename;
  eval {
    open my $fh, q[>], $filename or croak "Error opening $filename: $ERRNO";
    binmode $fh;
    print {$fh} $response->decoded_content or croak "Error writing to $filename: $ERRNO";
    close $fh or croak "Error closing $filename: $ERRNO";
    1;
  } or do {
    carp $EVAL_ERROR;
  };

  my $pdf              = CAM::PDF->new($filename);
  my $npages           = $pdf->numPages();
  my ($short_filename) = $page =~ m{([^/]+)$}smix;
  my $title            = unescape($short_filename);

  for my $pagenum (1..$npages) {
    my $str = $pdf->getPageText($pagenum);
    process_text($page, \$str, \$title);
  }

  unlink $filename;

  return 1;
}

</script>

Please note: WordPress keeps eating various bits of the code, like script tags, greater-than and ampersand entities so if you see weird coding errors it’s probably down to that.

Month: February 2011

A Site Indexer for Small Websites

Bookmarks for January 21st through February 5th