One concept, or data-structure paradigm if you will, which I’ve seen many, many times is the tree. Whatever sort of tree it is – binary or otherwise, it’s a hierarchy of some sort.
Tree structures are very common in every day life. Unfortunately in most programming languages they’re pretty clunky to code up. This aspect, coupled with they’re inherent inflexibility is why I think they don’t belong as a core part of a system which is required to scale horizontally.
The reason I believe this to be the case is because history, experience and current best-practices dictate that to scale well you define your scale-units to be entities which are as distinct from one-another as possible. This separation allows the system to treat them largely as separate units in an embarrassingly-parallel fashion, especially for functions like the popular MapReduce.
Tree structures simply don’t fit in to this paradigm. With any sort of hierarchy you end up binding each entity tightly to its parent, which in turn is bound tightly to each of its children. Think about what needs to happen if your tree is spread across many servers and you want to change a property of the tree, like the name of a node, or worse, delete a node and shuffle some of its children around. The complexity of this sort of manoeuvre is why trees don’t belong as primary keys of scalable systems. It’s also one of the reasons why document-store-databases like CouchDB are linear, key:value stores.
I’ve wanted to write this post for a long time but only recently have I been made frustrated enough to do so.
So.. some background.
When I worked at the Sanger Institute I ran the web team there. This was a team with three main roles –
Make sure the website was up
Internal software development for projects without dedicated informatics support
Support for projects with dedicated informatics
When I started, back in 1999 things were pretty disorganised but in terms of user-requirements actually a little easier – projects had the odd CGI script but most data were shipped out using file dumps on the FTP site. You see back then and for the few years’ previous, it was the dawning of the world-wide-web and web-users were much happier being faced with an FTP/gopher file-listing of .gz (or more likely, uncompressed .fasta) files to download.
Back then we had a couple of small DEC servers which ran the external- and internal- (intranet) websites. Fine. Well, fine that is, until you want to make a change.
Revision Control: Manual
Ok. You want to make a change. You take your nph-blast_server.cgi and make a copy nph-blast_server2.cgi . You make your changes and test them on the external website. Great! It works! You mail a collaborator across the pond to try it out for bugs. Fab! Nothing found. Ok, so copy it back over nph-blast_server.cgi and everyone’s happy.
What’s wrong with this picture? Well, you remember that development copy? Firstly, it’s still there. You just multiplied your attack-vectors by two (assuming there are bugs in the script capable of being exploited). Secondly, and this is more harmful to long-term maintenance, that development copy is the URL you mailed your collaborator. It’s also the URL your collaborator mailed around to his 20-strong informatics team and they posted on bulletin boards and USENET groups for the rest of the world.
Luckily you have a dedicated and talented web-team who sort out this chaos using a pile of server redirects. Phew! Saved.
Now multiply this problem by the 150-or-so dedicated informatics developers on campus serving content through the core servers. Take that number and multiply it by the number of CGI scripts each developer produces a month.
That is then the number of server redirects which every incoming web request has to be checked against before it reaches its target page. Things can become pretty slow.
Enter the development (staging) service
What happens next is that the web support guys do something radical. They persuade all the web developers on site by hook or by crook that they shouldn’t be editing content on the live, production, public servers. Instead they should use an internal (and for special cases, IP-restricted-external-access) development service, test their content before pushing it live, then use a special command, let’s call it webpublish, to push everything live.
Now to the enlightened developer of today that doesn’t sound radical, it just sounds like common sense. You should have heard the wailing and gnashing of teeth!
Shared development
At this point I could, should go into the whys and wherefores of using revision control, but I’ll save that for another post. Instead I want to focus on the drawbacks of sharing. My feeling is that the scenario above is a fairly common one where there are many authors working on the same site. It works really well for static content, even when a CMS is used. Unfortunately it’s not so great for software development. The simple fact is that requirements diverge – both for the project and for the software stack. These disparate teams only converge in that they’re running on the same hardware, so why should the support team expect their software requirements to converge also?
Allow me to illustrate one of the problems.
Projects A and B are hosted on the same server. They use the same centrally-supported library L. A, B and L each have a version. They all work happily together at version A1B1L1. Now B needs a new feature, but to add it requires an upgrade to L2. Unfortunately the L2 upgrade breaks A1. Project A therefore is obliged to undertake additional (usually unforeseen) work just to retain current functionality.
Another situation is less subtle and involves shared-user access. For developers this is most likely the root superuser although in my opinion any shared account is equally bad. When using a common user it’s very difficult to know who made a change in the past, let alone who’s making a change right now. I observed a situation recently where two developers were simultaneously trying to build RPMs with rpmbuild which, by default, builds in a system location like /usr/share . Simultaneously trying to access the same folders leads to very unpredictable, unrepeatable results. Arguably the worst situation is when no errors are thrown during the build and neither developer notices!
Naturally a lot of the same arguments against shared development go for shared production too. The support matrix simply explodes with a few tens of applications each with different prerequisites.
Other options
Back in the day there were fewer options – one was left with always having to use relative paths and often having to discard all but the core system prerequisites in fear of them changing unexpectedly over time. Using relative paths is still a fairly inexpensive way to do things but sometimes it’s just too restrictive. There is another way…
Virtualisation is now commonplace. You probably cross-paths with a virtual machine every day without knowing it. They’re ubiquitous because they’re really, really useful. For our development purposes one core support member can build a standard, supported virtual machine image and post it on the intranet somewhere. All the other developers can take it, start their own instances of it and do all of their own development on their own hardware without fighting for common resources. Upgrades can be tested independently of one another. Machines can be restarted from scratch and so on. Once development is complete and given sufficient core resources, each developer can even bundle up their working image and ship it into production as is. No further core support required!
What tools can you use to do this? Parallels? Too commercial. VMWare? A bit lardy. Xen? Probably a bit too hard-core. KVM? Not quite mature enough yet. No, my current favourite in the virtualisation stakes is VirtualBox. Cross platform and free. Works great with Ubuntu inside. A killer combination capable of solving many of these sorts of problems.
So.. Sorry as I am to say it, I’ve decided to take the plunge and switched this blog from the home-grown one over to wordpress, which is hard to beat for support – features & plugins.
With any luck it’ll mean this thing is updated much more frequently but I can’t say I’m happy using a PHP-based solution. Still – best tool for the job and all that. It’s a pity Krang or MT weren’t as easy & simple to set up.
I still have all the old entries deep in the bowels of my database but I need to spend a little time to port them over to WP. I figure it’s better to have the new system up and running first rather than hold it back until everything’s ready – iterative agile development and all that. Yum!
I had an interesting problem this morning with the Apache forward-proxy supporting the WTSI sequencing farm.
It would be useful for the intranet service for tracking runs to know which (GA2) sequencer is requesting pages but because they’re on a dedicated subnet they have to use a forward-proxy for fetching pages (and then only from intranet services).
Now I’m very familiar using the X-Forwarded-For header and HTTP_X_FORWARDED_FOR environment variable (and their friends) which do something very similar for reverse-proxies but forward-proxies usually want to disguise the fact there’s an arbitrary number of clients behind them, usually with irrelevant RFC1918 private IP addresses too.
So what I want to do is slightly unusual – take the remote_addr of the client and stuff it into a different header. I could use X-Forwarded-For but it doesn’t feel right. Proxy-Via is also not right here as that’s really for the proxy servers themselves. So, I figured mod_headers on the proxy would allow me to add additional headers to the request, even though it’s forwarded on. Also following a tip I saw here using my favourite mod_rewrite and after a bit of fiddling I can up with this:
#########
# copy remote addr to an internal variable
#
RewriteEngine On
RewriteCond %{REMOTE_ADDR} (.*)
RewriteRule .* - [E=SEQ_ADDR:%1]
#########
# set X-Sequencer header from the internal variable
#
RequestHeader set X-Sequencer %{SEQ_ADDR}
These rules sit in the container managing my proxy, after ProxyRequests and ProxyVia and before a small set of ProxyMatch restrictions.
The RewriteCond traps the contents of the REMOTE_ADDR environment variable (it’s not an HTTP header – it comes from the end of the network socket as determined by the server). The RewriteRule unconditionally copies the last RewriteCond match %1 into a new environment variable SEQ_ADDR. After this mod_headers sets the X-Sequencer request header (for the proxied request) to the value of the SEQ_ADDR environment variable.
This works very nicely though I’d have hoped a more elegant solution would be this:
RequestHeader set X-Sequencer %{REMOTE_ADDR}
but this doesn’t seem to work and I’m not sure why. Anyway, by comparing $ENV{HTTP_X_SEQUENCER} to a shared lookup table, the sequencing apps running on the intranet can now track which sequencer is making requests. Yay!
For some time now at Sanger we’ve been looking at the problems and solutions involved with building services supporting what are likely to become some of the biggest databases on the planet. The biggest problem is there aren't too many people doing this kind of thing and who are willing to talk about it.
The data we’re storing falls into two categories. Short Read Format (SRF) files containing sequence, quality and trace (~10Gb per lane) data and FastQ containing sequence and quality (~1Gb per lane).
Our requirements for these data are fundamentally for two different systems. One is a long-term archival system for SRF, the responsibility for which will eventually be shifted to the EBI . The second is, for me at least, the more interesting system –
The short-term storage of reads and qualities (and possibly also for selected alignments) isn’t the biggest problem – that honour is left to the fast, parallel retrieval of the same. The underlying data store needs to grow at a respectable 12TB per year and serve maybe a hundred simultaneous users requesting up to 1000 sequences per second.
Transfer times for reads are small but as a result are disproportionately affected by artefacts like TCP setup times, HTTP header payloads and certainly index seek times.
We’re looking at a few horizontally-scaling solutions for performing these kinds of jobs – the most obvious are tools like MapReduce and equivalents like Hadoop running with Nutch . My personal favourite and the one I’m holding out for is MogileFS from the same people who brought you Memcached . Time to get benchmarking!
So, this evening, not wanting to spend more time on the computer (having been on it all day for day 2 of DB’s Rails course) I spent my time honing my long-unused soldering skills and constructing the first revision of my infrared marker pen for the JCL-special Wiimote Whiteboard.
The raw materialsClose-up of the LEDs Im removingThe finished articleClose-up of the switch detailActivated under the IR-sensitive digital camera
I must say it’s turned out ok. I didn’t have any spare small switches so went for a bit of wire with enough springiness in it. On the opposite side of the makeshift switch is a retaining screw for holding the batteries in. I’m using two old AAA batteries (actually running about 2.4V according to the meter) and no resistor in series. The LED hasn’t burnt out yet!
To stop the pen switching on when not in use I slip a bit of electrical tape between the contacts. Obviously you can’t tell when it’s on unless you put in another, perhaps miniature, indicator visible LED.
It all fits together quite nicely though the retaining screw is too close for the batteries and has forced the back end out a bit – that’s easy to fix.
As I’m of course after multitouch I’ll be building the MkII pen soon with the other recovered LED!
It seems to be the wrong time to be reading such things, but over on InfoQ there’s a nice article introducing web development of RESTful services using Erlang and the Yaws high performance web server.
I say “the wrong time” as this week has kicked off the “Advancing with Rails” course by David A. Black of Ruby Power and Light fame. The course is fairly advanced in terms of required rails knowledge so it’s a bit of a baptism by fire for me and a few others having never written any Ruby before.
Rails is proving moderately easy to pick up but as I’ve remarked to a couple of people, it doesn’t seem any easier coding with Rails than with Perl. Perhaps it’s because I’ve never done it before but I reckon it’s a lot harder spending my time figuring out what the heck DHH meant something to do than it is doing it myself.
Even though it’s nowhere near as mature, I do reckon my ClearPress framework has a lot going for it – it’s pretty feature-complete in terms of ORM, views and templating ( TT2 ). It has similar convention over configuration features meaning it’s not designed for plugging in other alternative layers but it is absolutely possible to do (and I suspect without as much effort as is required in Rails). I still need to iron out some wrinkles in the autogenerated code from the application builder and provide some default authorisation and authentication mechanisms, some of which may come in the next release. But in the meantime it’s easy to add these features, which is exactly what we’ve done for the new sequencing run tracking app, NPG to tie it to the WTSI website single sign on (MySQL and LDAP under the hood).
For a few months now I’ve been watching utterly compelling and inspirational HCI things like these:
I know most of them are a bit dated now, in fact from as far back as 2006, but they’re still jaw-droppingly awesome.
So in a fit of inspiration and weekend project madness and frustration at the clumsiness of a regular touch-screen LCD I’ve been picking up things from Ebay and fishing around in my boxes of knackered electronics to find components suitable for assembling one or two of these sorts of devices.
There are two types of these interactive interfaces – the JCL-style wiimote-based ones which use bright sources of infrared, either transmitted or reflected and the bluetooth Nintendo controller; and the second is the Jeff Han / Perceptive Pixel -style of frustrated total internal reflection or FTIR where infrared is reflected out of a planar surface and is picked up by a camera similar to the one in the wiimote.
Anyway, costs so far:
Wiimote: ~£28; old infrared remote control for filters & LEDs: free;
Philips bSure XG2 projector: ~£180; Philips SPC900NC: ~£30; 4.3mm CCTV lens (no IR filter): ~$12
I’ve been having trouble making the bluetooth pairing for the wiimote work correctly under OSX 10.3.9 – I think it’s about time I had the laptop upgraded – it’s work’s after all. I think that should fix it for OSX, but I have had some success – this evening under Ubuntu with the Bluez stack and libwiimote I’ve been able to capture events from the wiimote including spots using the IR camera. I’ve also been successful using camstream with the SPC900NC and CCTV lens to capture spots from working TV remotes, both directly and reflected from a wall – it’s surprisingly effective!
More to come – next with the wiimote interface I need to build my whiteboard-marker battery-driven IR LED pen. Next with the FTIR display I need to experiment with a few different types of perspex and rear-reflection material. I *really* want to be able to perform pattern recognition similar to the reactable and I don’t think tracing paper will work for rear-projection. Knowing next to nothing about plastics technology I think I’d like to try frosted acrylic first, or maybe just finely-sanded regular acrylic. Ebay here I come again!
For a while now, more or less since I switched teams (from Core Web to Sequencing Informatics) I’ve wanted to write more about the work we do at Sanger. There’s so much of it which is absolute cutting edge research and a very large proportion of that is poorly communicated both inside and outside the institute. Most of it’s biology of course, which I know little about, and couldn’t discuss in detail, GCSE being the furthest I took things in that direction.
However some of the great advances have been in big IT. We’re in the same ballpark as CERN’s high-energy physics and NASA’s astronomical data. Technology is something I understand and can talk about here.
So… I run the new sequencing technology pipeline development team. This means I and my team are responsible for ensuring efficient use of the Sanger’s heavy investment in massively parallel sequencing instruments, primarily 28 Illumina Genome Analyzers. To do this we have a farm of 608 cores, a mix of 4- and 8-core Opteron blades with 8Gb RAM and a 320Tb shared Lustre filesystem. It seems to be becoming easy for users and administrators at Sanger to toss these figures around but the truth of the matter is that whilst this kit fits in only a handful of racks, it’s still a pretty big deal.
The blades run linux, Debian Etch to be precise. The Illumina-distributed analysis pipeline (itself a mix of Perl, Python and C++) is held together with Perl applications (web and batch) which also cooperate RESTfully with a series of Rails LIMS applications developed by the Production Software team.
Roughly a terabyte of image data is spun off each of the 28 instruments every 2-3 days. The images are stacked and aligned and sequences are basecalled from spot intensities. These short reads are then packaged up with quality values for each base and dropped into approximately 100Mb compressed result files ready for further secondary analysis (e.g. SNP-calling).
More to come later but for now the take-home message is that the setup we’re using is in my opinion a fair triumph, and definitely one to be proud of. It’s been a (fairly) harmonious marriage of tremendous hardware savvy from the systems group and the rapid turnaround of agile software development from Sequencing Informatics, of which I’m pleased to be a part.
I’ve worked as a software developer and worked with teams of software developers for around 10 years now, Many of those whom I’ve worked with have earned my trust and respect in relation to development and testing techniques. Frustratingly however it’s still with irritating regularity that I hear throw-away comments bourne of uncertainty and ignorance.
A couple of times now I’ve specifically been told that “GD makes my code go slow”. Now for those of you not in the know GD (actually specifically Lincoln Stein’s GD.pm in perl) is a wrapper around Tom Boutell’s most marvellous libgd graphics library. The combination of these two has always performed excellently for me and never been the bottleneck in any of my applications. The applications in question are usually database-backed web applications with graphics components for plotting genomic features or charts of one sort or another.
As any database-application developer will tell you, the database, or network connection to the database is almost always the bottleneck in an application or service. Great efforts are made to ensure database services scale well and perform as efficiently as possible, but even after these improvements are made they usually simply delay the inevitable.
Hence my frustration when I hear that “GD is making my (database) application go slow”. How? Where? Why? Where’s the proof? It’s no use blaming something, a library in this case, that’s out of your control. It’s hard to believe a claim like that without some sort of measurement.
So.. before pointing the finger, profile the code and make an effort to understand what the profiler is doing. In database applications profile your queries – use EXPLAIN, add indices, record SQL transcripts and time the results. Then profile the code which is manipulating those results.
Once the results are in of course, concentrate in the first instance on the parts with the most impact (e.g. 0.1 second off each iteration of a 1000x loop rather than 1 second from /int main/ ) – the low hanging fruit. Good programmers should be relatively lazy and speeding up code with the least amount of effort should be commonsense.