Bookmarks for January 21st through February 5th

These are my links for January 21st through February 5th:

Development Communications

For a while now, more or less since I switched teams (from Core Web to Sequencing Informatics) I’ve wanted to write more about the work we do at Sanger. There’s so much of it which is absolute cutting edge research and a very large proportion of that is poorly communicated both inside and outside the institute. Most of it’s biology of course, which I know little about, and couldn’t discuss in detail, GCSE being the furthest I took things in that direction.

However some of the great advances have been in big IT. We’re in the same ballpark as CERN’s high-energy physics and NASA’s astronomical data. Technology is something I understand and can talk about here.

So… I run the new sequencing technology pipeline development team. This means I and my team are responsible for ensuring efficient use of the Sanger’s heavy investment in massively parallel sequencing instruments, primarily 28 Illumina Genome Analyzers. To do this we have a farm of 608 cores, a mix of 4- and 8-core Opteron blades with 8Gb RAM and a 320Tb shared Lustre filesystem. It seems to be becoming easy for users and administrators at Sanger to toss these figures around but the truth of the matter is that whilst this kit fits in only a handful of racks, it’s still a pretty big deal.

The blades run linux, Debian Etch to be precise. The Illumina-distributed analysis pipeline (itself a mix of Perl, Python and C++) is held together with Perl applications (web and batch) which also cooperate RESTfully with a series of Rails LIMS applications developed by the Production Software team.

Roughly a terabyte of image data is spun off each of the 28 instruments every 2-3 days. The images are stacked and aligned and sequences are basecalled from spot intensities. These short reads are then packaged up with quality values for each base and dropped into approximately 100Mb compressed result files ready for further secondary analysis (e.g. SNP-calling).

More to come later but for now the take-home message is that the setup we’re using is in my opinion a fair triumph, and definitely one to be proud of. It’s been a (fairly) harmonious marriage of tremendous hardware savvy from the systems group and the rapid turnaround of agile software development from Sequencing Informatics, of which I’m pleased to be a part.