A Simple Continuous Integration (Jenkins) Dashboard

I had 15 minutes today to produce a wall-mounted-screen-compatible dashboard for showing the latest build statuses from our Jenkins continuous integration manager. It’s written in Perl and uses a few CPAN modules – XML::Simple, LWP::Simple and Readonly.

This is what it looks like:

and here’s the code:

#!/usr/local/bin/perl -T
use strict;
use warnings;
use XML::Simple;
use LWP::Simple qw(get);
use Carp;
use English qw(-no_match_vars);
use Readonly;

Readonly::Scalar our $CI      => q[http://my.ci.server/rssLatest];
Readonly::Scalar our $COLUMNS => 6;

my $str     = get($CI);
my $xml     = XMLin($str);
my @entries = map { $xml->{entry}->{$_} } sort keys %{$xml->{entry}};

print <<"EOT" or croak qq[Error printing: $ERRNO];
Content-type: text/html

<html>
 <head>
  <title>Continuous Integration HUD</title>
  <meta http-equiv="refresh" content="120; url=$ENV{SCRIPT_NAME}"/>
  <style type="text/css">
.stable { background-color: green }
.unstable { background-color: yellow }
.broken { background-color: red }
table { margin: 0 auto; }
a { font-size: bigger; text-decoration: none; color: black; }
  </style>
  <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.6.4/jquery.min.js"></script>
  <script type="text/javascript">
\$(document).ready(redraw);
\$(window).resize(redraw);

function redraw() {
   \$('tr').height((\$(window).height()-60)/\$('tr').size());
   \$('td').width((\$(window).width()-60)/\$('tr').first().find('td').size());
}
  </script>
 </head>
 <body>
EOT

print qq[<table>\n] or croak qq[Error printing: $ERRNO];
while(scalar @entries) {
  print qq[ <tr>\n] or croak qq[Error printing: $ERRNO];
  for my $j (1..$COLUMNS) {
    my $entry = shift @entries;
    if(!$entry) {
      last;
    }

    my $title = $entry->{title};
    my $class = q[stable];
    $class    = ($title =~ /unstable/smx) ? 'unstable' : $class;
    $class    = ($title =~ /broken/smx)   ? 'broken'   : $class;
    $title    =~ s{\s+[(].*?$}{}smx;

    my $href = $entry->{link}->{href};
    print qq[  <td class="$class"><a href="$href">$title</a></td>] or croak qq[Error printing: $ERRNO];
  }
  print qq[ </tr>\n] or croak qq[Error printing: $ERRNO];
}
print qq[</table>\n] or croak qq[Error printing: $ERRNO];

print <<'EOT' or croak qq[Error printing: $ERRNO];
 </body>
</html>
EOT

Bookmarks for June 28th through July 19th

These are my links for June 28th through July 19th:

Bookmarks for June 10th through June 27th

These are my links for June 10th through June 27th:

Bookmarks for May 20th through June 8th

These are my links for May 20th through June 8th:

Bookmarks for April 28th through May 15th

These are my links for April 28th through May 15th:

Bookmarks for February 15th through March 24th

These are my links for February 15th through March 24th:

ARISS Contact planned for school in India audible in Europe

With Discovery’s final trip earlier this week the ISS has been in the news again so I was interested to see this earlier today, received via John G0DET. Hopefully I’ll remember it’s happening and attempt to tune-in.

An ARISS School Contact is scheduled Saturday 12 March at 13:34 UTC (14.34 CEWT) for Shri Krishna Vithyaalayam Matric School, Pollachi (TK), India

The school is located in rural part of Tamilnadu state in India. The school has got a global partnership program with Westborough school in UK.

Mr.M.K.Anantha Ganesan, the founder and the Principal visited The Westborough school in September 2009. He is an amateur radio operator (VU3GPF). He visited the Chelmsford Amateur Radio Society and is made a honorary member. Anamalai Amateur Radio Club functions from our school.

The contact will be a telebridge, operated by IK1SLD.

Downlink signals will be audible over Europe on 145.800 MHz FM.

The conversation will be conducted in English. Students will ask as many of following questions as time allows.

  1. How many space walks are allowed and what purpose do space walks serve?
  2. How long the astronauts have to undergo the endurance tests?
  3. What type of food is taken by the astronauts?
  4. How far the Chandra x-ray observatory is useful in studying exotic phenomenon?
  5. What new construction is taking place in the ISS?
  6. What kind of scientific and technological experiments are undertaken by the astronauts?
  7. Please tell us about the emotional change that you undergo while spending moths together in space?
  8. Could you please tell us if the space junk will hamper space travel?
  9. What are your first impressions on seeing the mother earth for the first time from space?
  10. What steps have been initiated to reduce the impact of radiation exposure to the astronauts?
  11. What functions have been assigned to the Robonaut?
  12. Can you explain the term microgravity and how astronauts cope with that?
  13. What is your message to humanity as a whole and the students like us from India?

ARISS is an international educational outreach program partnering the participating space agencies, NASA, Russian Space Agency, ESA, CNES, JAXA, and CSA, with the AMSAT and IARU organizations from participating countries.

ARISS offers an opportunity for students to experience the excitement of Amateur Radio by talking directly with crewmembers onboard the International Space Station. Teachers, parents and communities see, first hand, how Amateur Radio and crewmembers on ISS can energize youngsters’ interest in science, technology and learning.

Gaston Bertels, ON4WF
ARISS Chairman

Thanks to SEARS Canvey Island for the Co-op with India.

John G8DET CARS.

A Site Indexer for Small Websites

So recently with one of my projects I needed a super simple website search facility. I didn’t want to go with Google because the integration is a bit ugly (and then there are the adverts). I didn’t want to go with Lucene or htDig because they were too heavy for my needs. I also didn’t want to use KinoSearch or Plucene because whilst they’re both great projects they were still over-the-top for what I needed to do.

So I decided to do what all bad programmers do – write my own. How hard can it be?

Well the first thing you need is an indexer. The indexer downloads pages of the target website, processes them for links which it adds to its queue and processes the textual content on the page for keywords which it then flattens out and stores in an inverted index, or one which is fast to go from word to website URL. On the way it can also store some context about the word, such as where it is on the page, whether it’s bold or in a heading and what are the words near it or linking to it.

So the first port of call is http://search.cpan.org/ where I found Scrappy from Al Newkirk. An automated indexer with callback handlers for different HTML tags. Awesome! This would sort all of the web fetching & HTML processing I had to do.

So I started with this:

crawl $starting_url, {
		      'a'  => \&process_url,
		      '/*' => \&process_page,
		     };

The function process_url() takes care of only processing links we like – the right sort of protocols (no mailto, file, ftp, gopher, wais, archie etc.) and the right sorts of files (don’t want to do much with images, css, javascript etc).

 sub process_url {
  my $url = shift->href;
  $url    =~ s{\#.*?$}{}smx;

  if($url =~ m{^(file|mailto|ftp|gopher|wais|archie|nntp|irc|news|telnet|svn)}) {
    #########
    # unsupported protocol
    #
    return;
  }

  if($url !~ m{^$starting_url}) {
    #########
    # external website
    #
    return;
  }

  if($url =~ m{pdf}smix) {
    #########
    # special handling for PDFs
    #
    push @{$pdfs}, $url;
    return;
  }

  if($url =~ m{(png|jpe?g|gif|zip|css|js|docx?|pptx?|xlsx?|odt|odp|ods)$}smix) {
    #########
    # unsupported filetype
    # todo: queue for independent processing
    #
    return;
  }

  queue $url;
}

You’ll notice that if we find a PDF we save it out for later, but generally “normal” HTML links to HTML pages will be added to Scrappy’s queue.

process_page() then looks a little like this:

sub process_page {
  my $page    = page;

  print "processing $page\n";
  my $html    = shift->html;
  my ($title) = $html =~ m{<title>(.*?)</title>}smxi;
  my $text    = $html;
  $text       =~ s{<script (.*?)/script>}{}smxig;
  $text       =~ s{< [^>]+>}{ }smxg;

  return process_text($page, \$text, \$title);
}</script>

This does some pretty ugly stuff – pull out the title, remove all script tags because they’re particularly unpleasant, then attempt to strip out all tags using a pattern match. I suppose this could be compared to brain surgery with a shovel – highly delicate!

The bulk of the text processing is broken out into a separate process_text() function so it can be reused for the PDF processing I want to do later on. It looks something like this:

sub process_text {
  my ($page, $text_ref, $title_ref) = @_;

  $page =~ s{$starting_url}{}smx;

  if($page !~ m{^/}smx) {
    $page = "/$page";
  }

  cleanup_entities($text_ref);
  cleanup_entities($title_ref);

  my @words = grep { /[a-z\d]{3,}/smix } # at least three alphanumerics
              grep { length $_ > 3 }     # longer than three characters
	      map  { s{\s+}{}smxg; $_ }  # trim spaces
	      map  { lc }                # store in lowercase
	      split /\b/smx,             # split on word boundary
	      ${$text_ref};

  for my $word (@words) {
    my $score = $dbh->selectall_arrayref(q[SELECT score from idx WHERE word=? AND page=?], {}, $word, $page)->[0]->[0];

    if(!defined $score) {
      my ($match) = ${$text_ref} =~ /($word)/smix;
      my $before  = substr ${$text_ref}, 0, $-[0];
      my $after   = substr ${$text_ref}, $+[0];
      $after      =~ s/((?:(?:\w+)(?:\W+)){10}).*/$1/;
      $before     = reverse $before; # reverse the string to limit backtracking.
      $before     =~ s/((?:(?:\W+)(?:\w+)){10}).*/$1/;
      $before     = reverse $before;

      my $context = "$before $match $after"; # use $match here instead of $word to retain case
      $context    =~ s/\s+/ /smxg; # switch many spaces for one
      $context    =~ s/^\s+//smxg; # strip leading space
      $context    =~ s/\s+$//smxg; # strip trailing space
      $dbh->do(q[INSERT INTO idx (word,page,title,score,context) values(?,?,?,1,?)], {}, $word, $page, ${$title_ref}, $context);

    } else {
      $dbh->do(q[UPDATE idx SET score=score+1 WHERE word=? AND page=?], {}, $word, $page);
    }
  }

  $dbh->commit;
  return 1;
}

Firstly this routine makes the incoming URL relative as it’s only indexing one site. Next it attempts to munge high characters into html entities using the same sort of shovel-brute-force approach as before. Next comes something a little smarter – split all words, lowercase them, trim them up and filter them to anything alphanumeric and longer than 3 letters.

Then for each one of these words, generate the context string of 10 words before and after and store them to a SQLite (oh, I didn’t mention that, did I?) database together with the count for that word’s presence in this URL/document.

Cool! So I now have a SQLite database containing URL, word, wordcount (score) and context. That’s pretty much all I need for a small website search… except, oh. Those dratted PDFs… Okay, here’s process_pdf().

sub process_pdf {
  my ($page) = @_;
  my $ua     = LWP::UserAgent->new;

  print "processing $page\n";

  $ua->agent('indexer');
  my $response = $ua->get($page);

  if (!$response->is_success) {
    carp $response->status_line;
    return;
  }

  my $tmp      = File::Temp->new;
  my $filename = sprintf q[%s.pdf], $tmp->filename;
  eval {
    open my $fh, q[>], $filename or croak "Error opening $filename: $ERRNO";
    binmode $fh;
    print {$fh} $response->decoded_content or croak "Error writing to $filename: $ERRNO";
    close $fh or croak "Error closing $filename: $ERRNO";
    1;
  } or do {
    carp $EVAL_ERROR;
  };

  my $pdf              = CAM::PDF->new($filename);
  my $npages           = $pdf->numPages();
  my ($short_filename) = $page =~ m{([^/]+)$}smix;
  my $title            = unescape($short_filename);

  for my $pagenum (1..$npages) {
    my $str = $pdf->getPageText($pagenum);
    process_text($page, \$str, \$title);
  }

  unlink $filename;

  return 1;
}

As you can see, we’re fetching the URL using a new LWP::UserAgent and saving it to a new temporary file, fairly run-of-the-mill stuff; then the magic happens, we ask CAM::PDF to process the file and tell us how many pages it has, then iterate over the pages ripping out all the textual content and throwing it back over to our old process_text routine from earlier. Bingo! As a good friend of mine says, “Job’s a good ‘un”.

We’ll see how the site search script works in another post. Here’s the whole code (no, I know it’s not particularly good, especially that atrocious entity handling). Of course it’s fairly easy to extend for other document types too – I’d like to the various (Open)Office formats but it depends what’s supported out of the box from CPAN.

#!/usr/local/bin/perl -T
#########
# Author: rmp@psyphi.net
# Created: 2011-02-04
# Perl Artistic License
#
use strict;
use warnings;
use Scrappy qw(:syntax);
use HTML::Entities;
use DBI;
use Carp;
use Getopt::Long;
use English qw(-no_match_vars);
use LWP::UserAgent;
use File::Temp qw(tempfile);
use CAM::PDF;
use CGI qw(unescape);

my $opts = {};
GetOptions($opts, qw(url=s dbname=s help)) or croak $ERRNO;

if(!$opts->{url} || !$opts->{dbname} || $opts->{help}) {
  print < <EOT;
indexer -url=http://website.to.crawl/ -dbname=database.sqlite [-help]
EOT
  exit;
}

my $pdfs         = [];
my $starting_url = $opts->{url};
my ($dbname)     = $opts->{dbname} =~ m{([a-z\d_/.\-]+)}smix;
my $dbh          = DBI->connect(qq[DBI:SQLite:dbname=$dbname],q[],q[],{
								       RaiseError => 1,
								       AutoCommit => 0,
								      });
eval {
  $dbh->do(q[drop table idx]); # todo: build tmp index and rename
};
$dbh->do(q[create table idx(word char(32), page char(255), title char(64), context char(64), score int)]);

crawl $starting_url, {
		      'a'  => \&process_url,
		      '/*' => \&process_page,
		     };

process_pdfs($pdfs);

sub process_pdfs {
  my ($pdfs) = @_;

  for my $pdf (@{$pdfs}) {
    eval {
      process_pdf($pdf);
      1;
    } or do {
      carp $EVAL_ERROR;
    };
  }
  return 1;
}

sub cleanup_entities {
  my ($str) = @_;

  if(!defined $str) {
    $str = q[];
  }

  encode_entities(${$str});
  $str =~ s/ / /smxg;
  decode_entities(${$str});
  decode_entities(${$str});

  ${$str} =~ s{[^\x20-\xff]}{ }smxig;
  ${$str} =~ s{\s+}{ }smxg;

  return 1;
}

sub process_url {
  my $url = shift->href;
  $url    =~ s{\#.*?$}{}smx;

  if($url =~ m{^(file|mailto|ftp|gopher|wais|archie|nntp|irc|news|telnet|svn)}) {
    #########
    # unsupported protocol
    #
    return;
  }

  if($url !~ m{^$starting_url}) {
    #########
    # external website
    #
    return;
  }

  if($url =~ m{pdf}smix) {
    #########
    # special handling for PDFs
    #
    push @{$pdfs}, $url;
    return;
  }

  if($url =~ m{(png|jpe?g|gif|zip|css|js|docx?|pptx?|xlsx?|odt|odp|ods)$}smix) {
    #########
    # unsupported filetype
    # todo: queue for independent processing
    #
    return;
  }

  queue $url;
}

sub process_page {
  my $page    = page;

  print "processing $page\n";
  my $html    = shift->html;
  my ($title) = $html =~ m{<title>(.*?)</title>}smxi;
  my $text    = $html;
  $text       =~ s{<script (.*?)/script>}{}smxig;
  $text       =~ s{< [^>]+>}{ }smxg;

  return process_text($page, \$text, \$title);
}

sub process_text {
  my ($page, $text_ref, $title_ref) = @_;

  $page =~ s{$starting_url}{}smx;

  if($page !~ m{^/}smx) {
    $page = "/$page";
  }

  cleanup_entities($text_ref);
  cleanup_entities($title_ref);

  my @words = grep { /[a-z\d]{3,}/smix } # at least three alphanumerics
              grep { length $_ > 3 }     # longer than three characters
	      map  { s{\s+}{}smxg; $_ }  # trim spaces
	      map  { lc }                # store in lowercase
	      split /\b/smx,             # split on word boundary
	      ${$text_ref};

  for my $word (@words) {
    my $score = $dbh->selectall_arrayref(q[SELECT score from idx WHERE word=? AND page=?], {}, $word, $page)->[0]->[0];

    if(!defined $score) {
      my ($match) = ${$text_ref} =~ /($word)/smix;
      my $before  = substr ${$text_ref}, 0, $-[0];
      my $after   = substr ${$text_ref}, $+[0];
      $after      =~ s/((?:(?:\w+)(?:\W+)){10}).*/$1/;
      $before     = reverse $before; # reverse the string to limit backtracking.
      $before     =~ s/((?:(?:\W+)(?:\w+)){10}).*/$1/;
      $before     = reverse $before;

      my $context = "$before $match $after"; # use $match here instead of $word to retain case
      $context    =~ s/\s+/ /smxg;
      $context    =~ s/^\s+//smxg;
      $context    =~ s/\s+$//smxg;
      $dbh->do(q[INSERT INTO idx (word,page,title,score,context) values(?,?,?,1,?)], {}, $word, $page, ${$title_ref}, $context);

    } else {
      $dbh->do(q[UPDATE idx SET score=score+1 WHERE word=? AND page=?], {}, $word, $page);
    }
  }

  $dbh->commit;
  return 1;
}

sub process_pdf {
  my ($page) = @_;
  my $ua     = LWP::UserAgent->new;

  print "processing $page\n";

  $ua->agent('indexer');
  my $response = $ua->get($page);

  if (!$response->is_success) {
    carp $response->status_line;
    return;
  }

  my $tmp      = File::Temp->new;
  my $filename = sprintf q[%s.pdf], $tmp->filename;
  eval {
    open my $fh, q[>], $filename or croak "Error opening $filename: $ERRNO";
    binmode $fh;
    print {$fh} $response->decoded_content or croak "Error writing to $filename: $ERRNO";
    close $fh or croak "Error closing $filename: $ERRNO";
    1;
  } or do {
    carp $EVAL_ERROR;
  };

  my $pdf              = CAM::PDF->new($filename);
  my $npages           = $pdf->numPages();
  my ($short_filename) = $page =~ m{([^/]+)$}smix;
  my $title            = unescape($short_filename);

  for my $pagenum (1..$npages) {
    my $str = $pdf->getPageText($pagenum);
    process_text($page, \$str, \$title);
  }

  unlink $filename;

  return 1;
}

</script>

Please note: WordPress keeps eating various bits of the code, like script tags, greater-than and ampersand entities so if you see weird coding errors it’s probably down to that.

Bookmarks for January 21st through February 5th

These are my links for January 21st through February 5th: