Interactivity Experiments

For a few months now I’ve been watching utterly compelling and inspirational HCI things like these:

I know most of them are a bit dated now, in fact from as far back as 2006, but they’re still jaw-droppingly awesome.

So in a fit of inspiration and weekend project madness and frustration at the clumsiness of a regular touch-screen LCD I’ve been picking up things from Ebay and fishing around in my boxes of knackered electronics to find components suitable for assembling one or two of these sorts of devices.

There are two types of these interactive interfaces – the JCL-style wiimote-based ones which use bright sources of infrared, either transmitted or reflected and the bluetooth Nintendo controller; and the second is the Jeff Han / Perceptive Pixel -style of frustrated total internal reflection or FTIR where infrared is reflected out of a planar surface and is picked up by a camera similar to the one in the wiimote.

Anyway, costs so far:

Wiimote: ~£28; old infrared remote control for filters & LEDs: free;

Philips bSure XG2 projector: ~£180; Philips SPC900NC: ~£30; 4.3mm CCTV lens (no IR filter): ~$12

I’ve been having trouble making the bluetooth pairing for the wiimote work correctly under OSX 10.3.9 – I think it’s about time I had the laptop upgraded – it’s work’s after all. I think that should fix it for OSX, but I have had some success – this evening under Ubuntu with the Bluez stack and libwiimote I’ve been able to capture events from the wiimote including spots using the IR camera. I’ve also been successful using camstream with the SPC900NC and CCTV lens to capture spots from working TV remotes, both directly and reflected from a wall – it’s surprisingly effective!

More to come – next with the wiimote interface I need to build my whiteboard-marker battery-driven IR LED pen. Next with the FTIR display I need to experiment with a few different types of perspex and rear-reflection material. I *really* want to be able to perform pattern recognition similar to the reactable and I don’t think tracing paper will work for rear-projection. Knowing next to nothing about plastics technology I think I’d like to try frosted acrylic first, or maybe just finely-sanded regular acrylic. Ebay here I come again!

Development Communications

For a while now, more or less since I switched teams (from Core Web to Sequencing Informatics) I’ve wanted to write more about the work we do at Sanger. There’s so much of it which is absolute cutting edge research and a very large proportion of that is poorly communicated both inside and outside the institute. Most of it’s biology of course, which I know little about, and couldn’t discuss in detail, GCSE being the furthest I took things in that direction.

However some of the great advances have been in big IT. We’re in the same ballpark as CERN’s high-energy physics and NASA’s astronomical data. Technology is something I understand and can talk about here.

So… I run the new sequencing technology pipeline development team. This means I and my team are responsible for ensuring efficient use of the Sanger’s heavy investment in massively parallel sequencing instruments, primarily 28 Illumina Genome Analyzers. To do this we have a farm of 608 cores, a mix of 4- and 8-core Opteron blades with 8Gb RAM and a 320Tb shared Lustre filesystem. It seems to be becoming easy for users and administrators at Sanger to toss these figures around but the truth of the matter is that whilst this kit fits in only a handful of racks, it’s still a pretty big deal.

The blades run linux, Debian Etch to be precise. The Illumina-distributed analysis pipeline (itself a mix of Perl, Python and C++) is held together with Perl applications (web and batch) which also cooperate RESTfully with a series of Rails LIMS applications developed by the Production Software team.

Roughly a terabyte of image data is spun off each of the 28 instruments every 2-3 days. The images are stacked and aligned and sequences are basecalled from spot intensities. These short reads are then packaged up with quality values for each base and dropped into approximately 100Mb compressed result files ready for further secondary analysis (e.g. SNP-calling).

More to come later but for now the take-home message is that the setup we’re using is in my opinion a fair triumph, and definitely one to be proud of. It’s been a (fairly) harmonious marriage of tremendous hardware savvy from the systems group and the rapid turnaround of agile software development from Sequencing Informatics, of which I’m pleased to be a part.

The Importance of Profiling

I’ve worked as a software developer and worked with teams of software developers for around 10 years now, Many of those whom I’ve worked with have earned my trust and respect in relation to development and testing techniques. Frustratingly however it’s still with irritating regularity that I hear throw-away comments bourne of uncertainty and ignorance.

A couple of times now I’ve specifically been told that “GD makes my code go slow”. Now for those of you not in the know GD (actually specifically Lincoln Stein’s GD.pm in perl) is a wrapper around Tom Boutell’s most marvellous libgd graphics library. The combination of these two has always performed excellently for me and never been the bottleneck in any of my applications. The applications in question are usually database-backed web applications with graphics components for plotting genomic features or charts of one sort or another.

As any database-application developer will tell you, the database, or network connection to the database is almost always the bottleneck in an application or service. Great efforts are made to ensure database services scale well and perform as efficiently as possible, but even after these improvements are made they usually simply delay the inevitable.

Hence my frustration when I hear that “GD is making my (database) application go slow”. How? Where? Why? Where’s the proof? It’s no use blaming something, a library in this case, that’s out of your control. It’s hard to believe a claim like that without some sort of measurement.

So.. before pointing the finger, profile the code and make an effort to understand what the profiler is doing. In database applications profile your queries – use EXPLAIN, add indices, record SQL transcripts and time the results. Then profile the code which is manipulating those results.

Once the results are in of course, concentrate in the first instance on the parts with the most impact (e.g. 0.1 second off each iteration of a 1000x loop rather than 1 second from /int main/ ) – the low hanging fruit. Good programmers should be relatively lazy and speeding up code with the least amount of effort should be commonsense.

Great pieces of code

A lot of what I do day-to-day is related to optimisation. Be it Perl code, SQL queries, Javascript or HTML there are usually at least a couple of cracking examples I find every week. On Friday I came across this:

SELECT cycle FROM goldcrest WHERE id_run = ?

This query is being used to find the number of the latest cycles (between 1 and 37 for each id_run) in a near-real-time tracking system and is used several times whenever a run report is viewed.

EXPLAIN SELECT cycle FROM goldcrest WHERE id_run = 231;
  
+----+-------------+-----------+------+---------------+---------+---------+-------+--------+-------------+
| id | select_type | table     | type | possible_keys | key     | key_len | ref   | rows   | Extra       |
+----+-------------+-----------+------+---------------+---------+---------+-------+--------+-------------+
|  1 | SIMPLE      | goldcrest | ref  | g_idrun       | g_idrun |       8 | const | 262792 | Using where |
+----+-------------+-----------+------+---------------+---------+---------+-------+--------+-------------+

In itself this would be fine but the goldcrest table in this instance contains several thousand rows for each id_run. So, for id_run, let’s say, 231 this query happens to return approximately 588,000 rows to determine that the latest cycle for run 231 is the number 34.

To clean this up we first try something like this:

SELECT MIN(cycle),MAX(cycle) FROM goldcrest WHERE id_run = ?

which still scans the 588000 rows (keyed on id_run incidentally) but doesn’t actually return them to the user, only one row containing both values we’re interested in. Fair enough, the CPU and disk access penalties are similar but the data transfer penalty is significantly improved.

Next I try adding an index against the id_run and cycle columns:

ALTER TABLE goldcrest ADD INDEX(id_run,cycle);
Query OK, 37589514 rows affected (23 min 6.17 sec)
Records: 37589514  Duplicates: 0  Warnings: 0

Now this of course takes a long time and, because the tuples are fairly redundant, creates a relatively inefficient index, also penalising future INSERTs. However, casually ignoring those facts, our query performance is now radically different:

EXPLAIN SELECT MIN(cycle),MAX(cycle) FROM goldcrest WHERE id_run = 231;
  
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                        |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
|  1 | SIMPLE      | NULL  | NULL | NULL          | NULL |    NULL | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
  
SELECT MIN(cycle),MAX(cycle) FROM goldcrest WHERE id_run = 231;
+------------+------------+
| MIN(cycle) | MAX(cycle) |
+------------+------------+
|          1 |         37 |
+------------+------------+
  
1 row in set (0.01 sec)

That looks a lot better to me now!

Generally I try to steer clear of the mysterious internal workings of database engines, but with much greater frequency come across examples like this:

sub clone_type {
  my ($self, $clone_type, $clone) = @_;
  my %clone_type;

  if($clone and $clone_type) {
    $clone_type{$clone} = $clone_type;
    return $clone_type{$clone};
  }

  return;
}

Thankfully this one’s pretty quick to figure out – they’re usually *much* more convoluted, but still.. Huh??

Pass in a clone_type scalar, create a local hash with the same name (Argh!), store the clone_type scalar in the hash keyed at position $clone, then return the same value we just stored.

I don’t get it… maybe a global hash or something else would make sense, but this works out the same:

sub clone_type {
  my ($self, $clone_type, $clone) = @_;

  if($clone and $clone_type) {
    return $clone_type;
  }
  return;
}

and I’m still not sure why you’d want to do that if you have the values on the way in already.

Programmers really need to think around the problem, not just through it. Thinking through may result in functionality but thinking around results in both function and performance which means a whole lot more in my book, and incidentally, why it seems so hard to hire good programmers.

OpenWRT WDS Bridging

I’ve had a pile of kit to configure recently for an office I’ve been setting up. Amongst the units I specified the second Linksys WRT54GL I’ve had the opportunity to play with.

My one runs White Russian but I took the plunge and went with the latest Kamikaze 7.09 release. It’s a little different to what I’d fiddled with before but probably more intuitive to configure with files rather than nvram variables. I’m briefly going to describe how to configure a wired switch bridged to the wireless network running WDS to the main site router (serving DHCP and DNS).

From a freshly unpacked WRT54GL, connect the ethernet WAN uplink to your internet connection and one of the LAN downlinks to a usable computer. By default the WRT DHCPs the WAN connection and serves DHCP on the 192.168.1 subnet to its LAN.

Download the firmware to the computer then login to the WRT on 192.168.1.1, default account admin/admin.  Upload the image to the firmware upgrade form. Wait for the upload to finish and the router to reboot.

Once it’s rebooted you may need to refresh the DHCP lease on the computer but the default subnet range is the same iirc. telnet to the router on the same address and login as root, no password. Change the password and the SSH service is enabled and telnet service disabled.

I personally prefer using the the x-wrt interface with the Zephyr theme so I install x-wrt by editing /etc/ipkg.conf and appending “src X-Wrt http://downloads.x-wrt.org/xwrt/kamikaze/7.09/brcm-2.4/packages”. Back in the shell run ipkg update ; ipkg install webif . Once completed you should be able to browse to the router’s address (hopefully still 192.168.1.1) and continue the configuration. You may wish to install matrixtunnel for SSL support in the web administration interface.

I want to use this WRT both to extend the coverage of my client’s office wireless network and to connect a handful of wired devices (1 PC, 1 Edgestore NAS and a NSLU2).

So step one is to assign the router a LAN address on my existing network. The WAN port is going to be ignored (although bridging that in as well is probably possible too). In X-wrt under Networks I set a static IP of 192.168.1.253 , netmask of 255.255.255.0 and a router of 192.168.1.254 – the existing main router; BT homehub serving the LAN and whose wireless we’ll be bridging to. The LAN connection type is bridged. DNS in this case is the same as the main router. I’ve left the WAN as DHCP for convenience though the plan is not to use it. Save the settings and apply.

Under Wireless turn the radio on and set the channel to the same as the main router. Choose lan to bridge the wireless network to, set mode to Access Point, WDS on, broadcast ESSID to your personal preference (I set on) and AP isolation off. The ESSID itself needs to be the existing name for your network and encryption set appropriately to match. Save and apply.

Now the magic bit – I’m told this should go in the BSSID box which only seems to be present when mode is set to WDS. What needs to happen is that the WRT needs to know which existing AP to bridge to. Under the hood it’s done using the command wlc wds main-ap-mac-address-here and not having an appropriate text box to put it in it’s almost always possible to fiddle with the startup file. It’s a hack for sure but it seems to work ok for me!

Lo! A WDS bridge.

Update 2007-01-07: After installing the bridge on-site I had to reconfigure it in Client mode using the regular WDS settings as that seemed to be the only way to make it communicate with the Homehub. Pity – that way it doesn’t extend the wireless range, just hooks up anything wired to it. It worked fine when I set it up talking to my wrt.

What can Bioinformatics learn from YouTube?

Caught Matt’s talk this morning at the weekly informatics group meetings

There were general murmurings of agreement amongst the audience but nobody asking the probing questions I’d hope for as a measure of interestedness.

Matt touched upon microformats in all but name – I was really expecting a sell of http://bioformats.org/ , websites as APIs and RESTful web services in particular.

Whilst I’m inclined to agree that standardised, discoverable, reusable web services are largely the way forward (especially as it keeps me in work) I’m not wholly convinced they remove the problems associated with, for example, database connections, database-engine specific SQL, hostnames, ports, accounts etc.

My feeling is that all the problems associated with keeping track of your database credentials are replaced by a different set of problems, albeit more standardised in terms of network protocols in HTTP and REST/CRUD. We now run the risk that what’s fixed in terms of network protocols is pushed higher up the stack and manifests as myriad web services, all different. All these new websites and services use different XML structures and different URL schemes. The XML structures are analogous to database table schema and the URL schemes akin to table or object names.

At least these entities are now discoverable by the end user/developer simply by using the web application – and there’s the big win – transparency and discoverability. There’s also the whole microformat affair – once these really start to take off there’ll be all sorts of arguments about what goes into them, especially in domains like Bio and Chem, not covered by core formats like hCard. But that’s something for another day.

More over at Green Is Good

7 utilities for improving application quality in Perl

I’d like to share with you a list of what are probably my top utilities for improving code quality (style, documentation, testing) with a largely Perl flavour. In loosely important-but-dull to exciting-and-weird order…

Test::More. Billed as yet another framework for writing test scripts Test::More extends Test::Simple and provides a bunch of more useful methods beyond Simple’s ok(). The ones I use most being use_ok() for testing compilation, is() for testing equality and like() for testing similarity with regexes.

ExtUtils::MakeMaker. Another one of Mike Schwern’s babies, MakeMaker is used to set up a folder structure and associated ‘make’ paraphernalia when first embarking on writing a module or application. Although developers these days tend to favour Module::Build over MakeMaker I prefer it for some reason (probably fear of change) and still make regular mileage using it.

Test::Pod::Coverage – what a great module! Check how good your documentation coverage is with respect to the code. No just a subroutine header won’t do! I tend to use Test::Pod::Coverage as part of…

Test::Distribution . Automatically run a battery of standard tests including pod coverage, manifest integrity, straight compilation and a load of other important things.

perlcritic, Test::Perl::Critic . The Perl::Critic set of tools is amazing. It’s built on PPI and implements the Perl Best Practices book by Damien Conway. Now I realise that not everyone agrees with a lot of what Damien says but the point is that it represents a standard to work to (and it’s not that bad once you’re used to it). Since I discovered perlcritic I’ve been developing all my code as close to perlcritic -1 (the most severe) as I can. It’s almost instantly made my applications more readable through systematic appearance and made faults easier to spot even before Test::Perl::Critic comes in.

Devel::Cover. I’m almost ashamed to say I only discovered this last week after dipping into Ian Langworthy and chromatic’s book ‘Perl Testing’. Devel::Cover gives code exercise metrics, i.e. how much of your module or application was actually executed by that test. It collates stats from all modules matching a user-specified pattern and dumps them out in a natty coloured table, very suitable for tying into your CI system.

Selenium . Ok, not strictly speaking a tool I’m using right this minute but it’s next on my list of integration tools. Selenium is a non-interactive, automated, browser-testing framework written in Javascript. This tool definitely has legs and it seems to have come a long way since I first found it in the middle of 2006. I’m hoping to have automated interface testing up and running before the end of the year as part of the Perl CI system I’m planning on putting together for the new sequencing pipeline.

Hiring Perl Developers – how hard can it be?

All the roles I’ve had during my time at Sanger have more or less required the development of production quality Perl code, usually OO and increasingly using MVC patterns. Why is it then that very nearly every Perl developer I’ve interviewed in the past 8 years is woefully lacking, specifically in OO Perl but more generally in half-decent programming skills?

It’s been astonishing, not in a good way, how many have been unable to demonstrate use of hashes. Some have been too scared of them (their words, not mine) and some have never felt the need. For those of you who aren’t Perl programmers, hashes (aka associative arrays) are a pretty crucial feature of the language and fundamental to its OO implementation.

Now I program in Perl sometimes more than 7-8 hours a day. For many years this also involved reworking other people’s code. I can very easily say that if you claim to be a Perl programmer and have never used hashes then you’re not going to get a Perl-related job because of your technical skills. With a good, interactive and engaging personality and a desire for self-improvement you might get away with it, but certainly not on technical merit.

It’s also quite worrying how many of these interviewees are unable to describe the basics of object-oriented programming yet have, for example, developed and sold a commercial ERP system, presumably for big bucks. Man, these people must have awesome marketing!

Frankly a number of the bioinformaticians already working there have similar skills to the interviewees and often worse communication skills, so maybe I’m simply setting my standards too high.

I really hope this situation improves when Perl 6 goes public though I’m sure it’ll take longer to become common parlance. As long as it happens before those smug RoR types take over the world I’ll be happy ;)

DECIPHERing Large-scale Copy-Number Variations

It’s strange.. Since moving from the core Web Team at Sanger to Sequencing Informatics I’ve been able to reduce my working hours from ~70-80/week all the way down to the 48.5 hours which are actually in my contract.

In theory this means I’ve more spare time, but in reality I’ve been able to secure sensible contract work outside Rentacoder which I’ve relied on in the past.

The work in question is optimising and refactoring for the DECIPHER project which I used to manage the technical side of whilst in the web team.

DECIPHER is a database of large-scale copy number variations (CNVs) from patient arrayCGH data curated by clinicians and cytogeneticists around the world. DECIPHER represents one of the first clinical applications to come out of the HGP data from Sanger.

What’s exciting apart from the medical implications of DECIPHER’s joined-up thinking is that it also represents a valuable model for social, clinical applications in the Web 2.0 world. The application draws in data from various external sources as well as its own curated database. It primarily uses DAS via Bio::Das::Lite and Bio::Das::ProServer and I’m now working on improving interfaces, interactivity and speed by leveraging MVC and SOA techniques with ClearPress and Prototype.

It’s a great opportunity for me to keep contributing to one of my favourite projects and hopefully implement a load of really neat features I’ve wanted to add for a long time. Stay tuned…

VoIP peering & profits

So… shortly, I believe from February next year but am probably mistaken, prices in the UK go up for calling “Lo-Call” 0845 numbers. As I understand it they’ll be similar, or the same as 0870 rates at 20p/min or so.

Now I wonder if the regulator has missed a trick here. It so happens that the nation is converting to broadband, be it ADSL or cable-based, and that very many of those broadband packages now come with VoIP offerings as standard.

My point is that these bundled broadband VoIP packages invariably come with 0845 dial-in numbers and no other choice. Dialing out via your broadband ISP may well be cheap for you but spare a thought for those calling in at much higher rates.

Having been tinkering with VoIP for a good few years I realise that actually this should be ok because calling VoIP-to-VoIP should be free, right? Wrong. Most of these ISPs don’t peer to each others’ networks – for two main reasons as far as I can see –

  1. They’re competitors and have little business reason to peer, apart from keeping the small proportion of aware customers happy.
  2. These ISPs make profits from users dialing in – 0845 is a profit-sharing prefix with which both BT and the ISP in question have a stake. This old story is of course also true of many telephone help-desks and similar. Keeping the customer on the line longer means more profits for the company and its shareholders.

It seems to me that the world could be a better, more communicative place through more thorough VoIP network peering but I simply can’t see it becoming widespread whilst profits stand in the way.