Bookmarks for June 30th through July 2nd

These are my links for June 30th through July 2nd:

Bookmarks for June 5th through June 26th

These are my links for June 5th through June 26th:

Random Sequence Mutator

Here’s a handy one(ish)-liner to mutate an input sequence using Perl’s RegEx engine:

epiphyte:~ rmp$ perl -e '$seq="ACTAGCTACGACTAGCATCGACT"; $mutants = [qw(A C T G)];
  print "$seq\n";
  $seq =~ s{([ATGC])}{ rand() < 0.5 ? $mutants->[int rand 4] : $1 }smiexg;
  print "$seq\n";'
ACTAGCTACGACTAGCATCGACT
ACAATCGCGGACCAGAATCTCTT

This gives each base in $seq a 50% chance (rand() < 0.5) of mutating to something, but as the original base is in the available $mutants array it has a further 25% chance of changing to itself. If you wanted to improve it by excluding the original base for each mutation you might do something like:

epiphyte:~ rmp$ perl -e '$seq="ACTAGCTACGACTAGCATCGACT"; $mutants = [qw(A C T G)];
  $mutsize=scalar @{$mutants}; print "$seq\n";
  $seq =~ s{([ATGC])}{ rand() < 0.5 ? [grep { $_ ne $1 } @{$mutants}]->[int rand $mutsize-1] : $1 }smiexg;
  print "$seq\n";'
ACTAGCTACGACTAGCATCGACT
TGTAGATAATGTGATACGAGACT

This (quite inefficiently) builds an array of all available options from $mutants, excluding $1 the matched base at each position.

Unrolling it and tidying it up a little for readability looks like this:

my $seq     = 'ACTAGCTACGACTAGCATCGACT';
my $mutants = [qw(A C T G)];
my $mutsize = scalar @{$mutants};

print "$seq\n";

$seq =~ s{([ATGC])}{
   rand() < 0.5
    ?
   [grep { $_ ne $1 } @{$mutants}]->[int rand $mutsize-1]
    :
   $1
 }smiexg;

print "$seq\n";'

Bookmarks for May 6th through May 22nd

These are my links for May 6th through May 22nd:

unique, overlapping kmer strings

Tinkering today I wrote a quick toy to generate strings of unique, overlapping kmers. Not particularly efficient, but possibly handy nonetheless.

It takes a given k size, a configurable overlap and optionally the bases to use. First it generates a list of all the kmers then it recursively scans for matching overlapping kmers and extends a seed, terminating the recursion and printing if all kmers have been used.

Run it like so:

 ./kmer-overlap -k=3 -overlap=2 ACTG
#!/usr/local/bin/perl
#########
# Author:        rmp
# Created:       2013-05-15
# Last Modified: $Date$
# Id:            $Id$
# HeadURL:       $HeadURL$
#
use strict;
use warnings;
use Getopt::Long;
use Readonly;
use English qw(-no_match_vars);

Readonly::Scalar our $DEFAULT_K => 3;
Readonly::Scalar our $DEFAULT_BASES => [qw(A C T G)];

my $opts = {};
GetOptions($opts, qw(k=s rand help));

if($opts->{help}) {
  print < <"EOT"; $PROGRAM_NAME - rmp 2013-05-15 Usage:  $PROGRAM_NAME -k=3 -overlap=2 -rand ACTG EOT   exit; } my $k       = $opts->{k}       || $DEFAULT_K;
my $overlap = $opts->{overlap} || $k-1;
my $bases   = $DEFAULT_BASES;

if(scalar @ARGV) {
  $bases = [grep { $_ } map {split //smx} @ARGV];
}

#########
# Build all available kmers
#
my $kmers = [];

for my $base1 (@{$bases}) {
  build($base1, $bases, $kmers);
}

#########
# optionally randomise the seeds
#
if($opts->{rand}) {
  shuffle($kmers);
}

#########
# start with a seed
#
for my $seed (@{$kmers}) {
  my $seen = {
	      $seed => 1,
	     };
  solve($seed, $seen);
}

sub build {
  my ($seq, $bases, $kmers) = @_;
  if(length $seq == $k) {
    #########
    # reached target k - store & terminate
    #
    push @{$kmers}, $seq;
    return 1;
  }

  for my $base (@{$bases}) {
    ########
    # extend and descend
    #
    build("$seq$base", $bases, $kmers);
  }

  return;
}

sub solve {
  my ($seq_in, $seen) = @_;

  if(scalar keys %{$seen} == scalar @{$kmers}) {
    #########
    # exhausted all kmers - completed!
    #
    print $seq_in, "\n";
    return 1;
  }

  my $seq_tail     = substr $seq_in, -$overlap, $overlap;

  my $overlapping  = [grep { $_ =~ /^$seq_tail/smx } # filter in only seqs which overlap the seed tail
		      grep { !$seen->{$_} }          # filter out kmers we've seen already
		      @{$kmers}];
  if(!scalar @{$overlapping}) {
    #########
    # no available overlapping kmers - terminate!
    #
    return;
  }

  if($opts->{rand}) {
    shuffle($overlapping);
  }

  my $overhang = $k-$overlap;
  for my $overlap_seq (@{$overlapping}) {
    #########
    # extend and descend
    #
    my $seq_out = $seq_in . substr $overlap_seq, -$overhang, $overhang;
    solve($seq_out, {%{$seen}, $overlap_seq => 1});
  }

  return;
}

sub shuffle {
  my ($arr_in) = @_;
  for my $i (0..scalar @{$arr_in}-1) {
    my $j = int rand $i;
    ($arr_in->[$i], $arr_in->[$j]) = ($arr_in->[$j], $arr_in->[$i]);
  }
}

Output looks like this:

epiphyte:~ rmp$ ./kmer-overlap -k=2 AC
AACCA
ACCAA
CAACC
CCAAC

restart a script when a new version is deployed

I have a lot of scripts running in a lot of places, doing various little jobs, mostly shuffling data files around and feeding them into pipelines and suchlike. I also use Jenkins CI to automatically run my tests and build deb packages for Debian/Ubuntu Linux. Unfortunately, being a lazy programmer I haven’t read up about all the great things deb and apt can do so I don’t know how to fire shell commands like “service x reload” or “/etc/init.d/x restart” once a package has been deployed. Kicking a script to pick up changes is quite a common thing to do.

Instead I have a little trick that makes use of the build process changing timestamps on files when it rolls up the package. So when the script wakes up, and starts the next iteration of its event loop, the first thing it does is check the timestamp of itself and if it’s different from the last iteration it executes itself, replacing the running process with a fresh one.

One added gotcha is that if you want to run in taint mode you need to satisfy a bunch of extra requirements such as detainting $ENV{PATH} and all commandline arguments before any re-execing occurs.

#!/usr/local/bin/perl
# -*- mode: cperl; tab-width: 8; indent-tabs-mode: nil; basic-offset: 2 -*-
# vim:ts=8:sw=2:et:sta:sts=2
#########
# Author: rpettett
# Last Modified: $Date$
# Id: $Id$
# $HeadURL$
#
use strict;
use warnings;
use Readonly;
use Carp;
use English qw(-no_match_vars);
our $VERSION = q[1.0];

Readonly::Scalar our $SLEEP_LONG  => 600;
Readonly::Scalar our $SLEEP_SHORT => 30;

$OUTPUT_AUTOFLUSH++;

my @original_argv = @ARGV;

#########

# handle SIGHUP restarts
#
local $SIG{HUP} = sub {
  carp q[caught SIGHUP];
  exec $PROGRAM_NAME, @original_argv;
};

my $last_modtime;

while(1) {
  #########
  # handle software-deployment restarts
  #
  my $modtime = -M $PROGRAM_NAME;

  if($last_modtime && $last_modtime ne $modtime) {
    carp q[re-execing];
    exec $PROGRAM_NAME, @original_argv;
  }
  $last_modtime = $modtime;

  my $did_work_flag;
  eval {
    $did_work_flag = do_stuff();
    1;
  } or do {
    $did_work_flag = 0;
  };

  local $SIG{ALRM} = sub {
    carp q[rudely awoken by SIGALRM];
  };

  my $sleep = $did_work_flag ? $SLEEP_SHORT : $SLEEP_LONG;
  carp qq[sleeping for $sleep];
  sleep $sleep;
}

Bookmarks for March 8th through April 30th

These are my links for March 8th through April 30th:

Middleware and Monorails

Middleware, the point-and-click programmer’s equivalent of Perl. Middleware, yet another layer of largely unnecessary abstraction.

It’s two hours in to the three hour e-commerce process mapping meeting with three sets of consultants in the heart and heat of London with the traffic noise, police sirens and vaguely pleasant Hare Krishna chanting wafting in through the open windows of the meeting room. I’m minding my own business quietly in the corner, designing pipelines, data flows and object models, coding prolifically and generally doing much more useful things in the safe confines of my head when it happens… What’s your middleware? Hmm? What? Was that directed at me? Of course it was. Nuts.

Middleware, middleware, I’d better think fast. Not a term I use but I’m sure I can remember what it actually means. It must sit between things… Oh yes, it’s coming back to me now. Middleware, the point-and-click programmer’s equivalent of Perl. Middleware, yet another layer of largely unnecessary abstraction. An API for APIs. A tool for fooling developers into thinking that they’re not tightly coupling their applications together when instead they’re tightly coupling to a third-party system they have even less control over because they’re incapable of agreeing direct service communication specs with that other application. “It’s ok, it’s standards-based”. Sure it is. Whatever you say… Middleware is the consultants’ friend though – a clever sounding service that does the integration for you but generally requires little direct development and provides easy resale of the same work to multiple clients.

I’ve been developing large-scale, high traffic websites for a few years now and I’ve never had need for a component that specifically markets itself as middleware. Not once. I’ve used plenty of APIs, web services, object brokers, message queues, key-value stores and any other number of components with easily identifiable purposes but I think using middleware for middleware’s sake is still just a little too meta for me. Yessir! It’s a genuine, bona fide, API’d middleware. I see it absolutely as a Springfield Monorail application, but those Shelbyville folk are so much smarter, maybe I should be more like them.

What’s my middleware? None, I don’t have one. I don’t want one, I’m pretty sure I don’t need one. Cue surprised looks, amusement and disbelief.

Ok, twist my arm, I suppose I quite like Zapier. (Ahh, ok, he’s one of us after all)

Update: I tried to find a pretty picture to augment the post with but I couldn’t find anything on Google Images or Flickr that didn’t make me want to punch the screen.

Systems & Security Tools du jour

I’ve been to two events in the past two weeks which have started me thinking harder about the way we protect and measure our enterprise systems.

The first of the two events was the fourth Splunk Live in St. Paul’s, London last week. I’ve been a big fan of Splunk for a few years but I’ve never really tried it out in production. The second was InfoSec at Earl’s Court. More about that one later.

What is Splunk?

To be honest, splunk is different things to different people. Since inception it’s had great value as a log collation and event alerting tool for systems administrators as that was what it was originally designed to do. However as both DJ Skillman and Godfrey Sullivan pointed out, Splunk has grown into a lot more than that. It solved a lot of “Big Data” (how I hate that phrase) problems before Big Data was trendy, taking arbitrary unstructured data sources structuring them in useful ways, indexing the hell out of them and adding friendly, near-real-time reporting and alerting on top. Nowadays, given the right data sources, Splunk is capable of providing across-the-board Operational Intelligence, yielding tremendous opportunities in measuring value of processes and events.

How does it work?

In order to make the most out of a Splunk installation you require at least three basic things :-

  1. A data source - anything from a basic syslog or Apache web server log to a live high level ERP logistics event feed or even entire code commits
  2. An enrichment process – something to tag packets, essentially to assign value to indexed fields, allowing the association of fields from different feeds, e.g. tallying new orders with a customer database with stock keeping perhaps.
  3. A report – a canned report, presented on a dashboard for your CFO for example, or an email alert to tell your IT manager that someone squirting 5 day experiments in at the head of the analysis pipeline is going to go over-budget on your AWS analysis pipeline in three days’ time.

How far can you go with it?

Well, here’s a few of the pick ‘n’ mix selection of things I’d like to start indexing as soon as we sort out a) the restricted data limits of our so-far-free Splunk installation and b) what’s legal to do

  • Door id access (physical site presence)
  • VPN logins (virtual site presence)
  • Wifi device registrations (guest, internal, whatever)
  • VoIP + PSTN call logs (number, duration)
  • Environmentals – temperature, humidity of labs, offices, server rooms
  • System logs for everything (syslog, authentication, Apache, FTPd, MySQL connections, Samba, the works)
  • SGE job logs with user & project accounting
  • Application logs for anything we’ve written in house
  • Experimental metadata (who ran what when, where, why)
  • Domains for all incoming + outgoing mail, plus mail/attachment weights (useful for spotting outliers exfiltrating data)
  • Firewall: accepted incoming connections
  • Continuous Integration test results (software project, timings, memory, cpu footprints)
  • SVN/Git code commits (yes, it’s possible to log the entire change set)
  • JIRA tickets (who, what, when, project, component, priority)
  • ERP logs (supply chain, logistics, stock control, manufacturing lead times)
  • CRM + online store logs (customer info, helpdesk cases, orders)
  • anything and everything else with vaguely any business value

I think it’s pretty obvious that all this stuff taken together constitutes what most people call Big Data these days. There’s quite a distinction between that sort of mixed relational data and the plainer “lots of data” I deal with day to day, experimental data in the order of a terabyte-plus per device per day.

Charts from Tables with D3js and jQuery

I’ve been tinkering with D3js on and off for a couple of months now, purely for generating simple, inline charts in web pages, made from data already dumped into HTML tables. Doing this is easier than building, caching and referencing external bitmap (PNG, GIF or whatever) images with Gnuplot or GD::Graph and also simpler than building bitmap images and serving them base64 encoded inline with <img alt=”” src=”data:…” />.

Using jQuery (or similar) to extract data from an already-present HTML table means there’s almost no code required whenever you want to add and plot a new column that someone might want to report on. Pushing all the work to the client should also mean slightly lighter server loads, though granted it’s already done the heavy lifting during the query to generate the table.

I’ve used examples from a number of sources, mostly from over on the d3js.org website itself and Mike Bostock’s inspiring example gallery. Plus the ever useful jQuery and jQueryUI libraries.

The result is a tabbed (with a jqueryui-themed unordered list) report based on a data table below. Clicking on either a tab or a table heading (all except the date) will animate and redraw the chart above. The data are collected using a jQuery selector on column classes in each.

Feel free to take and reuse it – just pinch the frame source.