ARISS Contact planned for school in India audible in Europe

With Discovery’s final trip earlier this week the ISS has been in the news again so I was interested to see this earlier today, received via John G0DET. Hopefully I’ll remember it’s happening and attempt to tune-in.

An ARISS School Contact is scheduled Saturday 12 March at 13:34 UTC (14.34 CEWT) for Shri Krishna Vithyaalayam Matric School, Pollachi (TK), India

The school is located in rural part of Tamilnadu state in India. The school has got a global partnership program with Westborough school in UK.

Mr.M.K.Anantha Ganesan, the founder and the Principal visited The Westborough school in September 2009. He is an amateur radio operator (VU3GPF). He visited the Chelmsford Amateur Radio Society and is made a honorary member. Anamalai Amateur Radio Club functions from our school.

The contact will be a telebridge, operated by IK1SLD.

Downlink signals will be audible over Europe on 145.800 MHz FM.

The conversation will be conducted in English. Students will ask as many of following questions as time allows.

  1. How many space walks are allowed and what purpose do space walks serve?
  2. How long the astronauts have to undergo the endurance tests?
  3. What type of food is taken by the astronauts?
  4. How far the Chandra x-ray observatory is useful in studying exotic phenomenon?
  5. What new construction is taking place in the ISS?
  6. What kind of scientific and technological experiments are undertaken by the astronauts?
  7. Please tell us about the emotional change that you undergo while spending moths together in space?
  8. Could you please tell us if the space junk will hamper space travel?
  9. What are your first impressions on seeing the mother earth for the first time from space?
  10. What steps have been initiated to reduce the impact of radiation exposure to the astronauts?
  11. What functions have been assigned to the Robonaut?
  12. Can you explain the term microgravity and how astronauts cope with that?
  13. What is your message to humanity as a whole and the students like us from India?

ARISS is an international educational outreach program partnering the participating space agencies, NASA, Russian Space Agency, ESA, CNES, JAXA, and CSA, with the AMSAT and IARU organizations from participating countries.

ARISS offers an opportunity for students to experience the excitement of Amateur Radio by talking directly with crewmembers onboard the International Space Station. Teachers, parents and communities see, first hand, how Amateur Radio and crewmembers on ISS can energize youngsters’ interest in science, technology and learning.

Gaston Bertels, ON4WF
ARISS Chairman

Thanks to SEARS Canvey Island for the Co-op with India.

John G8DET CARS.

A Site Indexer for Small Websites

So recently with one of my projects I needed a super simple website search facility. I didn’t want to go with Google because the integration is a bit ugly (and then there are the adverts). I didn’t want to go with Lucene or htDig because they were too heavy for my needs. I also didn’t want to use KinoSearch or Plucene because whilst they’re both great projects they were still over-the-top for what I needed to do.

So I decided to do what all bad programmers do – write my own. How hard can it be?

Well the first thing you need is an indexer. The indexer downloads pages of the target website, processes them for links which it adds to its queue and processes the textual content on the page for keywords which it then flattens out and stores in an inverted index, or one which is fast to go from word to website URL. On the way it can also store some context about the word, such as where it is on the page, whether it’s bold or in a heading and what are the words near it or linking to it.

So the first port of call is http://search.cpan.org/ where I found Scrappy from Al Newkirk. An automated indexer with callback handlers for different HTML tags. Awesome! This would sort all of the web fetching & HTML processing I had to do.

So I started with this:

crawl $starting_url, {
		      'a'  => \&process_url,
		      '/*' => \&process_page,
		     };

The function process_url() takes care of only processing links we like – the right sort of protocols (no mailto, file, ftp, gopher, wais, archie etc.) and the right sorts of files (don’t want to do much with images, css, javascript etc).

 sub process_url {
  my $url = shift->href;
  $url    =~ s{\#.*?$}{}smx;

  if($url =~ m{^(file|mailto|ftp|gopher|wais|archie|nntp|irc|news|telnet|svn)}) {
    #########
    # unsupported protocol
    #
    return;
  }

  if($url !~ m{^$starting_url}) {
    #########
    # external website
    #
    return;
  }

  if($url =~ m{pdf}smix) {
    #########
    # special handling for PDFs
    #
    push @{$pdfs}, $url;
    return;
  }

  if($url =~ m{(png|jpe?g|gif|zip|css|js|docx?|pptx?|xlsx?|odt|odp|ods)$}smix) {
    #########
    # unsupported filetype
    # todo: queue for independent processing
    #
    return;
  }

  queue $url;
}

You’ll notice that if we find a PDF we save it out for later, but generally “normal” HTML links to HTML pages will be added to Scrappy’s queue.

process_page() then looks a little like this:

sub process_page {
  my $page    = page;

  print "processing $page\n";
  my $html    = shift->html;
  my ($title) = $html =~ m{<title>(.*?)</title>}smxi;
  my $text    = $html;
  $text       =~ s{<script (.*?)/script>}{}smxig;
  $text       =~ s{< [^>]+>}{ }smxg;

  return process_text($page, \$text, \$title);
}</script>

This does some pretty ugly stuff – pull out the title, remove all script tags because they’re particularly unpleasant, then attempt to strip out all tags using a pattern match. I suppose this could be compared to brain surgery with a shovel – highly delicate!

The bulk of the text processing is broken out into a separate process_text() function so it can be reused for the PDF processing I want to do later on. It looks something like this:

sub process_text {
  my ($page, $text_ref, $title_ref) = @_;

  $page =~ s{$starting_url}{}smx;

  if($page !~ m{^/}smx) {
    $page = "/$page";
  }

  cleanup_entities($text_ref);
  cleanup_entities($title_ref);

  my @words = grep { /[a-z\d]{3,}/smix } # at least three alphanumerics
              grep { length $_ > 3 }     # longer than three characters
	      map  { s{\s+}{}smxg; $_ }  # trim spaces
	      map  { lc }                # store in lowercase
	      split /\b/smx,             # split on word boundary
	      ${$text_ref};

  for my $word (@words) {
    my $score = $dbh->selectall_arrayref(q[SELECT score from idx WHERE word=? AND page=?], {}, $word, $page)->[0]->[0];

    if(!defined $score) {
      my ($match) = ${$text_ref} =~ /($word)/smix;
      my $before  = substr ${$text_ref}, 0, $-[0];
      my $after   = substr ${$text_ref}, $+[0];
      $after      =~ s/((?:(?:\w+)(?:\W+)){10}).*/$1/;
      $before     = reverse $before; # reverse the string to limit backtracking.
      $before     =~ s/((?:(?:\W+)(?:\w+)){10}).*/$1/;
      $before     = reverse $before;

      my $context = "$before $match $after"; # use $match here instead of $word to retain case
      $context    =~ s/\s+/ /smxg; # switch many spaces for one
      $context    =~ s/^\s+//smxg; # strip leading space
      $context    =~ s/\s+$//smxg; # strip trailing space
      $dbh->do(q[INSERT INTO idx (word,page,title,score,context) values(?,?,?,1,?)], {}, $word, $page, ${$title_ref}, $context);

    } else {
      $dbh->do(q[UPDATE idx SET score=score+1 WHERE word=? AND page=?], {}, $word, $page);
    }
  }

  $dbh->commit;
  return 1;
}

Firstly this routine makes the incoming URL relative as it’s only indexing one site. Next it attempts to munge high characters into html entities using the same sort of shovel-brute-force approach as before. Next comes something a little smarter – split all words, lowercase them, trim them up and filter them to anything alphanumeric and longer than 3 letters.

Then for each one of these words, generate the context string of 10 words before and after and store them to a SQLite (oh, I didn’t mention that, did I?) database together with the count for that word’s presence in this URL/document.

Cool! So I now have a SQLite database containing URL, word, wordcount (score) and context. That’s pretty much all I need for a small website search… except, oh. Those dratted PDFs… Okay, here’s process_pdf().

sub process_pdf {
  my ($page) = @_;
  my $ua     = LWP::UserAgent->new;

  print "processing $page\n";

  $ua->agent('indexer');
  my $response = $ua->get($page);

  if (!$response->is_success) {
    carp $response->status_line;
    return;
  }

  my $tmp      = File::Temp->new;
  my $filename = sprintf q[%s.pdf], $tmp->filename;
  eval {
    open my $fh, q[>], $filename or croak "Error opening $filename: $ERRNO";
    binmode $fh;
    print {$fh} $response->decoded_content or croak "Error writing to $filename: $ERRNO";
    close $fh or croak "Error closing $filename: $ERRNO";
    1;
  } or do {
    carp $EVAL_ERROR;
  };

  my $pdf              = CAM::PDF->new($filename);
  my $npages           = $pdf->numPages();
  my ($short_filename) = $page =~ m{([^/]+)$}smix;
  my $title            = unescape($short_filename);

  for my $pagenum (1..$npages) {
    my $str = $pdf->getPageText($pagenum);
    process_text($page, \$str, \$title);
  }

  unlink $filename;

  return 1;
}

As you can see, we’re fetching the URL using a new LWP::UserAgent and saving it to a new temporary file, fairly run-of-the-mill stuff; then the magic happens, we ask CAM::PDF to process the file and tell us how many pages it has, then iterate over the pages ripping out all the textual content and throwing it back over to our old process_text routine from earlier. Bingo! As a good friend of mine says, “Job’s a good ‘un”.

We’ll see how the site search script works in another post. Here’s the whole code (no, I know it’s not particularly good, especially that atrocious entity handling). Of course it’s fairly easy to extend for other document types too – I’d like to the various (Open)Office formats but it depends what’s supported out of the box from CPAN.

#!/usr/local/bin/perl -T
#########
# Author: rmp@psyphi.net
# Created: 2011-02-04
# Perl Artistic License
#
use strict;
use warnings;
use Scrappy qw(:syntax);
use HTML::Entities;
use DBI;
use Carp;
use Getopt::Long;
use English qw(-no_match_vars);
use LWP::UserAgent;
use File::Temp qw(tempfile);
use CAM::PDF;
use CGI qw(unescape);

my $opts = {};
GetOptions($opts, qw(url=s dbname=s help)) or croak $ERRNO;

if(!$opts->{url} || !$opts->{dbname} || $opts->{help}) {
  print < <EOT;
indexer -url=http://website.to.crawl/ -dbname=database.sqlite [-help]
EOT
  exit;
}

my $pdfs         = [];
my $starting_url = $opts->{url};
my ($dbname)     = $opts->{dbname} =~ m{([a-z\d_/.\-]+)}smix;
my $dbh          = DBI->connect(qq[DBI:SQLite:dbname=$dbname],q[],q[],{
								       RaiseError => 1,
								       AutoCommit => 0,
								      });
eval {
  $dbh->do(q[drop table idx]); # todo: build tmp index and rename
};
$dbh->do(q[create table idx(word char(32), page char(255), title char(64), context char(64), score int)]);

crawl $starting_url, {
		      'a'  => \&process_url,
		      '/*' => \&process_page,
		     };

process_pdfs($pdfs);

sub process_pdfs {
  my ($pdfs) = @_;

  for my $pdf (@{$pdfs}) {
    eval {
      process_pdf($pdf);
      1;
    } or do {
      carp $EVAL_ERROR;
    };
  }
  return 1;
}

sub cleanup_entities {
  my ($str) = @_;

  if(!defined $str) {
    $str = q[];
  }

  encode_entities(${$str});
  $str =~ s/ / /smxg;
  decode_entities(${$str});
  decode_entities(${$str});

  ${$str} =~ s{[^\x20-\xff]}{ }smxig;
  ${$str} =~ s{\s+}{ }smxg;

  return 1;
}

sub process_url {
  my $url = shift->href;
  $url    =~ s{\#.*?$}{}smx;

  if($url =~ m{^(file|mailto|ftp|gopher|wais|archie|nntp|irc|news|telnet|svn)}) {
    #########
    # unsupported protocol
    #
    return;
  }

  if($url !~ m{^$starting_url}) {
    #########
    # external website
    #
    return;
  }

  if($url =~ m{pdf}smix) {
    #########
    # special handling for PDFs
    #
    push @{$pdfs}, $url;
    return;
  }

  if($url =~ m{(png|jpe?g|gif|zip|css|js|docx?|pptx?|xlsx?|odt|odp|ods)$}smix) {
    #########
    # unsupported filetype
    # todo: queue for independent processing
    #
    return;
  }

  queue $url;
}

sub process_page {
  my $page    = page;

  print "processing $page\n";
  my $html    = shift->html;
  my ($title) = $html =~ m{<title>(.*?)</title>}smxi;
  my $text    = $html;
  $text       =~ s{<script (.*?)/script>}{}smxig;
  $text       =~ s{< [^>]+>}{ }smxg;

  return process_text($page, \$text, \$title);
}

sub process_text {
  my ($page, $text_ref, $title_ref) = @_;

  $page =~ s{$starting_url}{}smx;

  if($page !~ m{^/}smx) {
    $page = "/$page";
  }

  cleanup_entities($text_ref);
  cleanup_entities($title_ref);

  my @words = grep { /[a-z\d]{3,}/smix } # at least three alphanumerics
              grep { length $_ > 3 }     # longer than three characters
	      map  { s{\s+}{}smxg; $_ }  # trim spaces
	      map  { lc }                # store in lowercase
	      split /\b/smx,             # split on word boundary
	      ${$text_ref};

  for my $word (@words) {
    my $score = $dbh->selectall_arrayref(q[SELECT score from idx WHERE word=? AND page=?], {}, $word, $page)->[0]->[0];

    if(!defined $score) {
      my ($match) = ${$text_ref} =~ /($word)/smix;
      my $before  = substr ${$text_ref}, 0, $-[0];
      my $after   = substr ${$text_ref}, $+[0];
      $after      =~ s/((?:(?:\w+)(?:\W+)){10}).*/$1/;
      $before     = reverse $before; # reverse the string to limit backtracking.
      $before     =~ s/((?:(?:\W+)(?:\w+)){10}).*/$1/;
      $before     = reverse $before;

      my $context = "$before $match $after"; # use $match here instead of $word to retain case
      $context    =~ s/\s+/ /smxg;
      $context    =~ s/^\s+//smxg;
      $context    =~ s/\s+$//smxg;
      $dbh->do(q[INSERT INTO idx (word,page,title,score,context) values(?,?,?,1,?)], {}, $word, $page, ${$title_ref}, $context);

    } else {
      $dbh->do(q[UPDATE idx SET score=score+1 WHERE word=? AND page=?], {}, $word, $page);
    }
  }

  $dbh->commit;
  return 1;
}

sub process_pdf {
  my ($page) = @_;
  my $ua     = LWP::UserAgent->new;

  print "processing $page\n";

  $ua->agent('indexer');
  my $response = $ua->get($page);

  if (!$response->is_success) {
    carp $response->status_line;
    return;
  }

  my $tmp      = File::Temp->new;
  my $filename = sprintf q[%s.pdf], $tmp->filename;
  eval {
    open my $fh, q[>], $filename or croak "Error opening $filename: $ERRNO";
    binmode $fh;
    print {$fh} $response->decoded_content or croak "Error writing to $filename: $ERRNO";
    close $fh or croak "Error closing $filename: $ERRNO";
    1;
  } or do {
    carp $EVAL_ERROR;
  };

  my $pdf              = CAM::PDF->new($filename);
  my $npages           = $pdf->numPages();
  my ($short_filename) = $page =~ m{([^/]+)$}smix;
  my $title            = unescape($short_filename);

  for my $pagenum (1..$npages) {
    my $str = $pdf->getPageText($pagenum);
    process_text($page, \$str, \$title);
  }

  unlink $filename;

  return 1;
}

</script>

Please note: WordPress keeps eating various bits of the code, like script tags, greater-than and ampersand entities so if you see weird coding errors it’s probably down to that.

Bookmarks for January 21st through February 5th

These are my links for January 21st through February 5th:

Bookmarks for December 7th through January 21st

These are my links for December 7th through January 21st:

Automated basic Xen snapshots

With a little bit of shell-scripting I made this. It runs on a dom0 and scans through virtual machines running on a cluster, requesting snapshots of each one. You may want to use vm-snapshot-with-quiesce if it’s supported by xen-tools on your domU machines.

for i in `xe vm-list params=name-label | grep name | awk '{print $NF}' | xargs echo`; do echo $i; xe vm-snapshot vm=$i new-name-label="$i-`date +'%Y-%m-%dT%H:%M:%S'`"; done

If you cron the above script every day then after a few days you may want to start deleting the oldest snapshots:

for i in `xe vm-list params=name-label | grep name | awk '{print $NF}'`; do snapshot=`xe vm-list name-label=$i params=snapshots | awk '{print $NF}'`; echo "$i earliest snapshot is $snapshot"; if [ "$snapshot" ]; then xe vm-uninstall vm=$snapshot force=true; fi; done

Bookmarks for November 25th through December 6th

These are my links for November 25th through December 6th:

Bookmarks for November 15th through November 25th

These are my links for November 15th through November 25th:

Bookmarks for October 27th through November 15th

These are my links for October 27th through November 15th:

Bookmarks for October 14th through October 27th

These are my links for October 14th through October 27th:

Bookmarks for October 7th through October 13th

These are my links for October 7th through October 13th: