Bookmarks for March 9th through March 17th

These are my links for March 9th through March 17th:

Bookmarks for February 17th through March 5th

These are my links for February 17th through March 5th:

An Interview Question

I’d like to share a basic interview question I’ve used in the past. I’ve used this in a number of different guises over the years, both at Sanger and at ONT but the (very small!) core remains the same. It still seems to be able to trip up a lot of people who sell themselves as senior developers on their CVs and demand £35k+ salaries.

You have a list of characters.

  1. Remove duplicates

The time taken for the interviewee to scratch their head determines whether they’re a Perl programmer, or at least think like one – this is an idomatic question in Perl. It’s a fairly standard solution to anyone who uses hashes, maps or associative arrays in any language. It’s certainly a lot harder without them.

The answer I would expect to see would run something like this:

#########
# pass in an array ref of characters, e.g.
# remove_dupes([qw(a e r o i g n o s e w f e r g e r i g e o n k)]);
#
sub remove_dupes {
  my $chars_in  = shift;
  my $seen      = {};
  my $chars_out = [];

  for my $char (@{$chars_in}) {
    if(!$seen->{$char}++) {
      push @{$chars_out}, $char;
    }
  }

  return $chars_out;
}

Or for the more adventurous, using a string rather than an array:

#########
# pass in a string of characters, e.g.
# remove_dupes(q[uyavubnopwemgnisudhjopwenfbuihrpgbwogpnskbjugisjb]);
#
sub remove_dupes {
  my $str  = shift;
  my $seen = {};
  $str     =~ s/(.)/( !$seen->{$1}++ ) ? $1 : q[]/smegx;
  return $str;
}

The natural progression from Q1 then follows. It should be immediately obvious to the interviewee if they answered Q1 inappropriately.

  1. List duplicates
#########
# pass in an array ref of characters, e.g.
# list_dupes([qw(a e r o i g n o s e w f e r g e r i g e o n k)]);
#
sub list_dupes {
  my $chars_in  = shift;
  my $seen      = {};
  my $chars_out = [];

  for my $char (@{$chars_in}) {
    $seen->{$char}++;
  }

  return [ grep { $seen->{$_} > 1 } keys %{$seen} ];
}

and with a string

#########
# pass in a string of characters, e.g.
# list_dupes(q[uyavubnopwemgnisudhjopwenfbuihrpgbwogpnskbjugisjb]);
#
sub list_dupes {
  my $str  = shift;
  my $seen = {};
  $str     =~ s/(.)/( $seen->{$1}++ > 1) ? $1 : q[]/smegx;
  return $str;
}

The standard follow-up is then “Given more time, what would you do to improve this?”. Well? What would you do? I know what I would do before I even started – WRITE SOME TESTS!

It’s pretty safe to assume that any communicative, personable candidate who starts off writing a test on the board will probably be head and shoulders above any other.

If I’m interviewing you tomorrow and you’re reading this now, it’s also safe to mention it. Interest in the subject and a working knowledge of the intertubes generally comes in handy for a web developer. I’m hiring you as an independent thinker!

Configuring Hudson Continuous Integration Slaves

Roughly this time last year I set up Hudson at the office to do Oxford Nanopore’s continuous integration builds. It’s been a pleasure to roll-out the automated-testing work-ethic for developers and (perhaps surprisingly) there’s been less resistance than I expected. Hudson is a great weapon in the arsenal and is dead easy to set up as a master server. This week though I’ve had to do something new – configure one of our product simulators as my first Hudson slave server. Again this proved to be a doddle – thanks Hudson!

My slave server is called “device2” – Here’s what I needed to set up on the master (ci.example.net).

“Manage Hudson” => “Manage Nodes” => “New Node” => node name = “device2”, type is “dumb slave” => number of executors = 1, remote fs root = “/home/hudson/ci”, leave for tied jobs only, launch by JNLP.

Then on device2:

adduser hudson
su hudson
cd
mkdir ci
cd ci
wget http://ci.example.net/jnlpJars/slave.jar

Made a new file /etc/init.d/hudson-slave with these contents:

#!/bin/bash
#
# Init file for hudson server daemon
#
# chkconfig: 2345 65 15
# description: Hudson slave

. /etc/rc.d/init.d/functions

RETVAL=0
NODE_NAME="device2"
HUDSON_HOST=ci.example.net
USER=hudson
LOG=/home/hudson/ci/hudson.log
SLAVE_JAR=/home/hudson/ci/slave.jar

pid_of_hudson() {
    ps auxwww | grep java | grep hudson | grep -v grep | awk '{ print $2 }'
}

start() {
    echo -n $"Starting hudson: "
    COMMAND="java -jar $SLAVE_JAR -jnlpUrl \
        http://${HUDSON_HOST}/computer/${NODE_NAME}/slave-agent.jnlp \
        2>$LOG.err \
        >$LOG.out"
    su ${USER} -c "$COMMAND" &
    sleep 1
    pid_of_hudson > /dev/null
    RETVAL=$?
    [ $RETVAL = 0 ] && success || failure
    echo
}

stop() {
    echo -n "Stopping hudson slave: "
    pid=`pid_of_hudson`
    [ -n "$pid" ] && kill $pid
    RETVAL=$?
    cnt=10
    while [ $RETVAL = 0 -a $cnt -gt 0 ] &&
        { pid_of_hudson > /dev/null ; } ; do
        sleep 1
        ((cnt--))
    done

    [ $RETVAL = 0 ] && success || failure
    echo
}

status() {
    pid=`pid_of_hudson`
    if [ -n "$pid" ]; then
        echo "hudson (pid $pid) is running..."
        return 0
    fi
    echo "hudson is stopped"
    return 3
}

#Switch on called
case "$1" in
    start)
        start
        ;;
    stop)
        stop
        ;;
    status)
        status
        ;;
    restart)
        stop
        start
        ;;
    *)
        echo $"Usage: $0 (start|stop|restart|status}"
        exit 1
esac

exit $RETVAL
chkconfig --add hudson-slave
chkconfig hudson-slave on
service hudson-slave start

and that’s pretty much all there was to it – refreshing the node-status page on the hudson master showed the slave had registered itself, then reconfiguring one of the existing jobs to be tied to the “device2” slave immediately started assigning jobs to it. Tremendous!

Bookmarks for February 5th through February 10th

These are my links for February 5th through February 10th:

Bookmarks for January 29th through February 3rd

These are my links for January 29th through February 3rd:

Exa-, Peta-, Tera-scale Informatics: Are *YOU* in the cloud yet?

http://www.flickr.com/photos/pagedooley/2511369048/

One of the aspects of my job over the last few years, both at Sanger and now at Oxford Nanopore Technologies has been the management of tera-, verging on peta- scale data on a daily basis.

Various methods of handling filesystems this large have been around for a while now and I won’t go into them here. Building these filesystems is actually fairly straightforward as most of them are implemented as regular, repeatable units – great for horizontal scale-out.

No, what makes this a difficult problem isn’t the sheer volume of data, it’s the amount of churn. Churn can be defined as the rate at which new files are added and old files are removed.

To illustrate – when I left Sanger, if memory serves, we were generally recording around a terabyte of new data a day. The staging area there was around 0.5 Petabytes (using the Lustre filesystem) but didn’t balance correctly across the many disks. This meant we had to keep the utilised space below around 90% for fear of filling up an individual storage unit (and leading to unexpected errors). Ok, so that’s 450TB. That left 45 days of storage – one and a half months assuming no slack.

Fair enough. Sort of. collect the data onto the staging area, analyse it there and shift it off. Well, that’s easier said than done – you can shift it off onto slower, cheaper storage but that’s generally archival space so ideally you only keep raw data. If the raw data are too big then you keep the primary analysis and ditch the raw. But there’s a problem with that:

  • lots of clever people want to squeeze as much interesting stuff out of the raw data as possible using new algorithms.
  • They also keep finding things wrong with the primary analyses and so want to go back and reanalyse.
  • Added to that there are often problems with the primary analysis pipeline (bleeding-edge software bugs etc.).
  • That’s not mentioning the fact that nobody ever wants to delete anything

As there’s little or no slack in the system, very often people are too busy to look at their own data as soon as it’s analysed so it might sit there broken for a week or four. What happens then is there’s a scrum for compute-resources so they can analyse everything before the remaining 2-weeks of staging storage is up. Then even if there are problems found it can be too late to go back and reanalyse because there’s a shortage of space for new runs and stopping the instruments running because you’re out of space is a definite no-no!

What the heck? Organisationally this isn’t cool at all. Situations like this are only going to worsen! The technologies are improving all the time – run-times are increasing, read-lengths are increasing, base-quality is increasing, analysis is becoming better and more instruments are becoming available to more people who are using them for more things. That’s a many, many-fold increase in storage requirements.

So how to fix it? Well I can think of at least one pretty good way. Don’t invest in on-site long-term staging- or scratch-storage. If you’re worried by all means sort out an awesome backup system but nearline it or offline to a decent tape archive or something and absolutely do not allow user-access. Instead of long-term staging storage buy your company the fattest Internet pipe it can handle. Invest in connectivity, then simply invest in cloud storage. There are enough providers out there now to make this a competitive and interesting marketplace with opportunities for economies of scale.

What does this give you? Well, many benefits – here are a few:

  • virtually unlimited storage
  • only pay for what you use
  • accountable costs – know exactly how much each project needs to invest
  • managed by storage experts
  • flexible computing attached to storage on-demand
  • no additional power overheads
  • no additional space overheads

Most of those I more-or-less take for granted these days. The one I find interesting at the moment is the costing issue. It can be pretty hard to hold one centralised storage area accountable for different groups – they’ll often pitch in for proportion of the whole based on their estimated use compared to everyone else. With accountable storage offered by the cloud each group can manage and pay for their own space. The costs are transparent to them and the responsibility has been delegated away from central management. I think that’s an extremely attractive prospect!

The biggest argument I hear against cloud storage & computing is that your top secret, private data is in someone else’s hands. Aside from my general dislike of secret data, these days I still don’t believe this is a good argument. There are enough methods for handling encryption and private networking that this pretty-much becomes a non-issue. Encrypt the data on-site, store the keys in your own internal database, ship the data to the cloud and when you need to run analysis fetch the appropriate keys over an encrypted link, decode the data on demand, re-encrypt the results and ship them back. Sure the encryption overheads add expense to the operation but I think the costs are far outweighed.

Bookmarks for January 26th through January 27th

These are my links for January 26th through January 27th:

Bookmarks for January 25th through January 26th

These are my links for January 25th through January 26th:

The Importance of Being Open

Openness => Leadership
http://www.flickr.com/photos/psd/3717444865/sizes/m/

Those of you who know me know I’m pretty keen on openness. The hypocrisy of using a Powerbook for my daily work and keeping an iPhone in my pocket continue to not be lost on me, but I want to spend a few lines on the power of openness.

Imagine if you will, a company developing a new laboratory device. The device is amazing and for early-access partners exceeds expectations in terms of performance but the early-access community feeling is that it’s capable of much more.

The company is keen to work with the community but wants to retain control for various reasons – most importantly in order to restrict their support matrix and reduce their long-term headaches. Fair enough. To this end all changes (and we’re talking principally about software changes here, but applies equally to hardware, chemistry etc. too) have to be emailed to the principal developer before they’re considered for incorporation into the product.

Let’s say the device was released to 20 sites worldwide for early-access. Now each one of those institutes is going to want to twiddle and tweak all the settings in order to make the device perform for their experiments.

So now you have community contributions from 20 different sites. Great! No.. wait.. what’s that? The principal developer’s on a tight schedule for the next major release? Oh.. the release is only incorporating changes the company wants to focus on for the mass market (e.g. packaging, polish, robustness). Hmm. Okay so the community patches are put back until next release… then the next…

Then something cool happens. The community sets up their own online exchanges to provide peer-to-peer support. Warranties and support be damned, they’ve managed to make the devices work twice as fast for half the cost of reagents. They’re also bypassing all the artificial delays imposed by the company, receiving updates as quickly as they’re released, and left with the choice of whether to deploy a particular patch or not.

So where’s the company left after all this? Well, the longer this inadvertent exclusion of community continues, the further the company will be left out in the cold. The worse their reputation will be as portrayed by their early-access clients too.

Surely this doesn’t happen? Of course it does. Every day. Companies still fear to engage Open Source as a valid business model and enterprising hackers bypass arbitrary, mostly pointless restrictions on all sorts of devices (*cough* DRM!) to make them work in ways the original manufacturer never intended.

So.. what’s the moral of this story? If you’re developing devices, be they physical or virtual, make them open. Give them simple, open APIs and good examples. Give them a public revision control system like git, or subversion. Self-documentation may sound like a cliché but there’s nothing like a good usage example (or unit/functional/integration test) to define how to utilise a service.

If your devices are good at what they do (caveat!) then because they’re open, they’ll proceed to dominate the market until something better comes along. The community will adopt them, support them and extend them and you’ll still have sales and core support. New users will want to make the investment simply because of what others have done with the platform.

Ok, so perhaps this is too naieve a standpoint. Big companies can use the weight of their “industry standards”, large existing customer-base, or even ease-of-use to bulldoze widespread adoption but the lack of openness doesn’t suit everybody and my feeling is that a lot of users, or would-be-users don’t appreciate being straight-jacketed into a one-size-fits-all solution. Open FTW!