One of the aspects of my job over the last few years, both at Sanger and now at Oxford Nanopore Technologies has been the management of tera-, verging on peta- scale data on a daily basis.
Various methods of handling filesystems this large have been around for a while now and I won’t go into them here. Building these filesystems is actually fairly straightforward as most of them are implemented as regular, repeatable units – great for horizontal scale-out.
No, what makes this a difficult problem isn’t the sheer volume of data, it’s the amount of churn. Churn can be defined as the rate at which new files are added and old files are removed.
To illustrate – when I left Sanger, if memory serves, we were generally recording around a terabyte of new data a day. The staging area there was around 0.5 Petabytes (using the Lustre filesystem) but didn’t balance correctly across the many disks. This meant we had to keep the utilised space below around 90% for fear of filling up an individual storage unit (and leading to unexpected errors). Ok, so that’s 450TB. That left 45 days of storage – one and a half months assuming no slack.
Fair enough. Sort of. collect the data onto the staging area, analyse it there and shift it off. Well, that’s easier said than done – you can shift it off onto slower, cheaper storage but that’s generally archival space so ideally you only keep raw data. If the raw data are too big then you keep the primary analysis and ditch the raw. But there’s a problem with that:
- lots of clever people want to squeeze as much interesting stuff out of the raw data as possible using new algorithms.
- They also keep finding things wrong with the primary analyses and so want to go back and reanalyse.
- Added to that there are often problems with the primary analysis pipeline (bleeding-edge software bugs etc.).
- That’s not mentioning the fact that nobody ever wants to delete anything
As there’s little or no slack in the system, very often people are too busy to look at their own data as soon as it’s analysed so it might sit there broken for a week or four. What happens then is there’s a scrum for compute-resources so they can analyse everything before the remaining 2-weeks of staging storage is up. Then even if there are problems found it can be too late to go back and reanalyse because there’s a shortage of space for new runs and stopping the instruments running because you’re out of space is a definite no-no!
What the heck? Organisationally this isn’t cool at all. Situations like this are only going to worsen! The technologies are improving all the time – run-times are increasing, read-lengths are increasing, base-quality is increasing, analysis is becoming better and more instruments are becoming available to more people who are using them for more things. That’s a many, many-fold increase in storage requirements.
So how to fix it? Well I can think of at least one pretty good way. Don’t invest in on-site long-term staging- or scratch-storage. If you’re worried by all means sort out an awesome backup system but nearline it or offline to a decent tape archive or something and absolutely do not allow user-access. Instead of long-term staging storage buy your company the fattest Internet pipe it can handle. Invest in connectivity, then simply invest in cloud storage. There are enough providers out there now to make this a competitive and interesting marketplace with opportunities for economies of scale.
What does this give you? Well, many benefits – here are a few:
- virtually unlimited storage
- only pay for what you use
- accountable costs – know exactly how much each project needs to invest
- managed by storage experts
- flexible computing attached to storage on-demand
- no additional power overheads
- no additional space overheads
Most of those I more-or-less take for granted these days. The one I find interesting at the moment is the costing issue. It can be pretty hard to hold one centralised storage area accountable for different groups – they’ll often pitch in for proportion of the whole based on their estimated use compared to everyone else. With accountable storage offered by the cloud each group can manage and pay for their own space. The costs are transparent to them and the responsibility has been delegated away from central management. I think that’s an extremely attractive prospect!
The biggest argument I hear against cloud storage & computing is that your top secret, private data is in someone else’s hands. Aside from my general dislike of secret data, these days I still don’t believe this is a good argument. There are enough methods for handling encryption and private networking that this pretty-much becomes a non-issue. Encrypt the data on-site, store the keys in your own internal database, ship the data to the cloud and when you need to run analysis fetch the appropriate keys over an encrypted link, decode the data on demand, re-encrypt the results and ship them back. Sure the encryption overheads add expense to the operation but I think the costs are far outweighed.
Hi Roger,
Interesting post and I think it make sense, especially for small shops that can’t afford the hardware and personal a research center such as the Sanger has, but I fear it might still be Science Fiction at the moment.
A very naive way to look at it would be to say you need an internet connectivity slightly over 100 mb/s to be able to sync 1 TB in a day (8*1024^3/(24*60^2)).
However, if you refer to the excellent interview of Guy Coates that you pointed your twitter follower at (http://www.bio-itworld.com/issues/2009/nov-dec/wellcome.html), you will notice that they seem to only get 5-10% of their maximum bandwidth (5-10% of 2 gb/s should be sufficient however to be over 100 mb/s), and that it gets worse the farther away they go.
Guy mentionned how difficult it is to say where the bottleneck lies, and I fear it might actually be the i/o or network bandwidth of the cloud storage provider. Alas, I don’t have more info on this. Maybe you will be able to use your contacts to retrieve some more up-to-date informations and write a follow-up post?
On a different topic, I am not sure after reading your post if you advise to retrieve the dataset and process it locally in in-house facilities or do you forsee a possibility for researchers to work on the latest dataset internally (i-e while it’s hot data) and then use the cloud to work on a copy of the cold data?
I think it would actually make a lot of sense since in that case you can tune your on-site HPC/infrastructure for the work you have to do on a daily basis on the hot dataset. One would argue that work that needs to be conducted on cold datas is probably exceptionnel/extra and that it would disrupt the normal flow of operations to reprocess it on the in-house facilities, so it might make sens to process it in the cloud. There are however probably cases (say a pandemy or a global crisis) when you might want to download the whole set and process it internally, as I suspect it might be a a factor or two faster than in-cloud processing, but that will probably be more an exception that the norm.
Cheers
Gildas