Exa-, Peta-, Tera-scale Informatics: Are *YOU* in the cloud yet?


One of the aspects of my job over the last few years, both at Sanger and now at Oxford Nanopore Technologies has been the management of tera-, verging on peta- scale data on a daily basis.

Various methods of handling filesystems this large have been around for a while now and I won’t go into them here. Building these filesystems is actually fairly straightforward as most of them are implemented as regular, repeatable units – great for horizontal scale-out.

No, what makes this a difficult problem isn’t the sheer volume of data, it’s the amount of churn. Churn can be defined as the rate at which new files are added and old files are removed.

To illustrate – when I left Sanger, if memory serves, we were generally recording around a terabyte of new data a day. The staging area there was around 0.5 Petabytes (using the Lustre filesystem) but didn’t balance correctly across the many disks. This meant we had to keep the utilised space below around 90% for fear of filling up an individual storage unit (and leading to unexpected errors). Ok, so that’s 450TB. That left 45 days of storage – one and a half months assuming no slack.

Fair enough. Sort of. collect the data onto the staging area, analyse it there and shift it off. Well, that’s easier said than done – you can shift it off onto slower, cheaper storage but that’s generally archival space so ideally you only keep raw data. If the raw data are too big then you keep the primary analysis and ditch the raw. But there’s a problem with that:

  • lots of clever people want to squeeze as much interesting stuff out of the raw data as possible using new algorithms.
  • They also keep finding things wrong with the primary analyses and so want to go back and reanalyse.
  • Added to that there are often problems with the primary analysis pipeline (bleeding-edge software bugs etc.).
  • That’s not mentioning the fact that nobody ever wants to delete anything

As there’s little or no slack in the system, very often people are too busy to look at their own data as soon as it’s analysed so it might sit there broken for a week or four. What happens then is there’s a scrum for compute-resources so they can analyse everything before the remaining 2-weeks of staging storage is up. Then even if there are problems found it can be too late to go back and reanalyse because there’s a shortage of space for new runs and stopping the instruments running because you’re out of space is a definite no-no!

What the heck? Organisationally this isn’t cool at all. Situations like this are only going to worsen! The technologies are improving all the time – run-times are increasing, read-lengths are increasing, base-quality is increasing, analysis is becoming better and more instruments are becoming available to more people who are using them for more things. That’s a many, many-fold increase in storage requirements.

So how to fix it? Well I can think of at least one pretty good way. Don’t invest in on-site long-term staging- or scratch-storage. If you’re worried by all means sort out an awesome backup system but nearline it or offline to a decent tape archive or something and absolutely do not allow user-access. Instead of long-term staging storage buy your company the fattest Internet pipe it can handle. Invest in connectivity, then simply invest in cloud storage. There are enough providers out there now to make this a competitive and interesting marketplace with opportunities for economies of scale.

What does this give you? Well, many benefits – here are a few:

  • virtually unlimited storage
  • only pay for what you use
  • accountable costs – know exactly how much each project needs to invest
  • managed by storage experts
  • flexible computing attached to storage on-demand
  • no additional power overheads
  • no additional space overheads

Most of those I more-or-less take for granted these days. The one I find interesting at the moment is the costing issue. It can be pretty hard to hold one centralised storage area accountable for different groups – they’ll often pitch in for proportion of the whole based on their estimated use compared to everyone else. With accountable storage offered by the cloud each group can manage and pay for their own space. The costs are transparent to them and the responsibility has been delegated away from central management. I think that’s an extremely attractive prospect!

The biggest argument I hear against cloud storage & computing is that your top secret, private data is in someone else’s hands. Aside from my general dislike of secret data, these days I still don’t believe this is a good argument. There are enough methods for handling encryption and private networking that this pretty-much becomes a non-issue. Encrypt the data on-site, store the keys in your own internal database, ship the data to the cloud and when you need to run analysis fetch the appropriate keys over an encrypted link, decode the data on demand, re-encrypt the results and ship them back. Sure the encryption overheads add expense to the operation but I think the costs are far outweighed.

Bookmarks for January 26th through January 27th

These are my links for January 26th through January 27th:

Bookmarks for January 25th through January 26th

These are my links for January 25th through January 26th:

The Importance of Being Open

Openness => Leadership

Those of you who know me know I’m pretty keen on openness. The hypocrisy of using a Powerbook for my daily work and keeping an iPhone in my pocket continue to not be lost on me, but I want to spend a few lines on the power of openness.

Imagine if you will, a company developing a new laboratory device. The device is amazing and for early-access partners exceeds expectations in terms of performance but the early-access community feeling is that it’s capable of much more.

The company is keen to work with the community but wants to retain control for various reasons – most importantly in order to restrict their support matrix and reduce their long-term headaches. Fair enough. To this end all changes (and we’re talking principally about software changes here, but applies equally to hardware, chemistry etc. too) have to be emailed to the principal developer before they’re considered for incorporation into the product.

Let’s say the device was released to 20 sites worldwide for early-access. Now each one of those institutes is going to want to twiddle and tweak all the settings in order to make the device perform for their experiments.

So now you have community contributions from 20 different sites. Great! No.. wait.. what’s that? The principal developer’s on a tight schedule for the next major release? Oh.. the release is only incorporating changes the company wants to focus on for the mass market (e.g. packaging, polish, robustness). Hmm. Okay so the community patches are put back until next release… then the next…

Then something cool happens. The community sets up their own online exchanges to provide peer-to-peer support. Warranties and support be damned, they’ve managed to make the devices work twice as fast for half the cost of reagents. They’re also bypassing all the artificial delays imposed by the company, receiving updates as quickly as they’re released, and left with the choice of whether to deploy a particular patch or not.

So where’s the company left after all this? Well, the longer this inadvertent exclusion of community continues, the further the company will be left out in the cold. The worse their reputation will be as portrayed by their early-access clients too.

Surely this doesn’t happen? Of course it does. Every day. Companies still fear to engage Open Source as a valid business model and enterprising hackers bypass arbitrary, mostly pointless restrictions on all sorts of devices (*cough* DRM!) to make them work in ways the original manufacturer never intended.

So.. what’s the moral of this story? If you’re developing devices, be they physical or virtual, make them open. Give them simple, open APIs and good examples. Give them a public revision control system like git, or subversion. Self-documentation may sound like a cliché but there’s nothing like a good usage example (or unit/functional/integration test) to define how to utilise a service.

If your devices are good at what they do (caveat!) then because they’re open, they’ll proceed to dominate the market until something better comes along. The community will adopt them, support them and extend them and you’ll still have sales and core support. New users will want to make the investment simply because of what others have done with the platform.

Ok, so perhaps this is too naieve a standpoint. Big companies can use the weight of their “industry standards”, large existing customer-base, or even ease-of-use to bulldoze widespread adoption but the lack of openness doesn’t suit everybody and my feeling is that a lot of users, or would-be-users don’t appreciate being straight-jacketed into a one-size-fits-all solution. Open FTW!

On scaling data structures


One concept, or data-structure paradigm if you will, which I’ve seen many, many times is the tree. Whatever sort of tree it is – binary or otherwise, it’s a hierarchy of some sort.

Tree structures are very common in every day life. Unfortunately in most programming languages they’re pretty clunky to code up. This aspect, coupled with they’re inherent inflexibility is why I think they don’t belong as a core part of a system which is required to scale horizontally.

The reason I believe this to be the case is because history, experience and current best-practices dictate that to scale well you define your scale-units to be entities which are as distinct from one-another as possible. This separation allows the system to treat them largely as separate units in an embarrassingly-parallel fashion, especially for functions like the popular MapReduce.

Tree structures simply don’t fit in to this paradigm. With any sort of hierarchy you end up binding each entity tightly to its parent, which in turn is bound tightly to each of its children. Think about what needs to happen if your tree is spread across many servers and you want to change a property of the tree, like the name of a node, or worse, delete a node and shuffle some of its children around. The complexity of this sort of manoeuvre is why trees don’t belong as primary keys of scalable systems. It’s also one of the reasons why document-store-databases like CouchDB are linear, key:value stores.

Shared Development Environments; why they suck and what you can do about it


I’ve wanted to write this post for a long time but only recently have I been made frustrated enough to do so.

So.. some background.

When I worked at the Sanger Institute I ran the web team there. This was a team with three main roles –

  1. Make sure the website was up
  2. Internal software development for projects without dedicated informatics support
  3. Support for projects with dedicated informatics

When I started, back in 1999 things were pretty disorganised but in terms of user-requirements actually a little easier – projects had the odd CGI script but most data were shipped out using file dumps on the FTP site. You see back then and for the few years’ previous, it was the dawning of the world-wide-web and web-users were much happier being faced with an FTP/gopher file-listing of .gz (or more likely, uncompressed .fasta) files to download.

Back then we had a couple of small DEC servers which ran the external- and internal- (intranet) websites. Fine. Well, fine that is, until you want to make a change.

Revision Control: Manual

Ok. You want to make a change. You take your nph-blast_server.cgi and make a copy nph-blast_server2.cgi . You make your changes and test them on the external website. Great! It works! You mail a collaborator across the pond to try it out for bugs. Fab! Nothing found. Ok, so copy it back over nph-blast_server.cgi and everyone’s happy.

What’s wrong with this picture? Well, you remember that development copy? Firstly, it’s still there. You just multiplied your attack-vectors by two (assuming there are bugs in the script capable of being exploited). Secondly, and this is more harmful to long-term maintenance, that development copy is the URL you mailed your collaborator. It’s also the URL your collaborator mailed around to his 20-strong informatics team and they posted on bulletin boards and USENET groups for the rest of the world.

Luckily you have a dedicated and talented web-team who sort out this chaos using a pile of server redirects. Phew! Saved.

Now multiply this problem by the 150-or-so dedicated informatics developers on campus serving content through the core servers. Take that number and multiply it by the number of CGI scripts each developer produces a month.

That is then the number of server redirects which every incoming web request has to be checked against before it reaches its target page. Things can become pretty slow.

Enter the development (staging) service

What happens next is that the web support guys do something radical. They persuade all the web developers on site by hook or by crook that they shouldn’t be editing content on the live, production, public servers. Instead they should use an internal (and for special cases, IP-restricted-external-access) development service, test their content before pushing it live, then use a special command, let’s call it webpublish, to push everything live.

Now to the enlightened developer of today that doesn’t sound radical, it just sounds like common sense. You should have heard the wailing and gnashing of teeth!

Shared development

At this point I could, should go into the whys and wherefores of using revision control, but I’ll save that for another post. Instead I want to focus on the drawbacks of sharing. My feeling is that the scenario above is a fairly common one where there are many authors working on the same site. It works really well for static content, even when a CMS is used. Unfortunately it’s not so great for software development. The simple fact is that requirements diverge – both for the project and for the software stack. These disparate teams only converge in that they’re running on the same hardware, so why should the support team expect their software requirements to converge also?

Allow me to illustrate one of the problems.

Projects A and B are hosted on the same server. They use the same centrally-supported library L. A, B and L each have a version. They all work happily together at version A1B1L1. Now B needs a new feature, but to add it requires an upgrade to L2. Unfortunately the L2 upgrade breaks A1. Project A therefore is obliged to undertake additional (usually unforeseen) work just to retain current functionality.

Another situation is less subtle and involves shared-user access. For developers this is most likely the root superuser although in my opinion any shared account is equally bad. When using a common user it’s very difficult to know who made a change in the past, let alone who’s making a change right now. I observed a situation recently where two developers were simultaneously trying to build RPMs with rpmbuild which, by default, builds in a system location like /usr/share . Simultaneously trying to access the same folders leads to very unpredictable, unrepeatable results. Arguably the worst situation is when no errors are thrown during the build and neither developer notices!

Naturally a lot of the same arguments against shared development go for shared production too. The support matrix simply explodes with a few tens of applications each with different prerequisites.

Other options

Back in the day there were fewer options – one was left with always having to use relative paths and often having to discard all but the core system prerequisites in fear of them changing unexpectedly over time. Using relative paths is still a fairly inexpensive way to do things but sometimes it’s just too restrictive. There is another way…

Virtualisation is now commonplace. You probably cross-paths with a virtual machine every day without knowing it. They’re ubiquitous because they’re really, really useful. For our development purposes one core support member can build a standard, supported virtual machine image and post it on the intranet somewhere. All the other developers can take it, start their own instances of it and do all of their own development on their own hardware without fighting for common resources. Upgrades can be tested independently of one another. Machines can be restarted from scratch and so on. Once development is complete and given sufficient core resources, each developer can even bundle up their working image and ship it into production as is. No further core support required!

What tools can you use to do this? Parallels? Too commercial. VMWare? A bit lardy. Xen? Probably a bit too hard-core. KVM? Not quite mature enough yet. No, my current favourite in the virtualisation stakes is VirtualBox. Cross platform and free. Works great with Ubuntu inside. A killer combination capable of solving many of these sorts of problems.

Switching to WordPress

So.. Sorry as I am to say it, I’ve decided to take the plunge and switched this blog from the home-grown one over to wordpress, which is hard to beat for support – features & plugins.

With any luck it’ll mean this thing is updated much more frequently but I can’t say I’m happy using a PHP-based solution. Still – best tool for the job and all that. It’s a pity Krang or MT weren’t as easy & simple to set up.

I still have all the old entries deep in the bowels of my database but I need to spend a little time to port them over to WP. I figure it’s better to have the new system up and running first rather than hold it back until everything’s ready – iterative agile development and all that. Yum!