bigdata - psyphi.net blog

I’ve been to two events in the past two weeks which have started me thinking harder about the way we protect and measure our enterprise systems.

The first of the two events was the fourth Splunk Live in St. Paul’s, London last week. I’ve been a big fan of Splunk for a few years but I’ve never really tried it out in production. The second was InfoSec at Earl’s Court. More about that one later.

What is Splunk?

To be honest, splunk is different things to different people. Since inception it’s had great value as a log collation and event alerting tool for systems administrators as that was what it was originally designed to do. However as both DJ Skillman and Godfrey Sullivan pointed out, Splunk has grown into a lot more than that. It solved a lot of “Big Data” (how I hate that phrase) problems before Big Data was trendy, taking arbitrary unstructured data sources structuring them in useful ways, indexing the hell out of them and adding friendly, near-real-time reporting and alerting on top. Nowadays, given the right data sources, Splunk is capable of providing across-the-board Operational Intelligence, yielding tremendous opportunities in measuring value of processes and events.

How does it work?

In order to make the most out of a Splunk installation you require at least three basic things :-

A data source -Â anything from a basic syslog or Apache web server log to a live high level ERP logistics event feed or even entire code commits
An enrichment process – something to tag packets, essentially to assign value to indexed fields, allowing the association of fields from different feeds, e.g. tallying new orders with a customer database with stock keeping perhaps.
A report – a canned report, presented on a dashboard for your CFO for example, or an email alert to tell your IT manager that someone squirting 5 day experiments in at the head of the analysis pipeline is going to go over-budget on your AWS analysis pipeline in three days’ time.

How far can you go with it?

Well, here’s a few of the pick ‘n’ mix selection of things I’d like to start indexing as soon as we sort out a) the restricted data limits of our so-far-free Splunk installation and b) what’s legal to do

Door id access (physical site presence)
VPN logins (virtual site presence)
Wifi device registrations (guest, internal, whatever)
VoIP + PSTN call logs (number, duration)
Environmentals – temperature, humidity of labs, offices, server rooms
System logs for everything (syslog, authentication, Apache, FTPd, MySQL connections, Samba, the works)
SGE job logs with user & project accounting
Application logs for anything we’ve written in house
Experimental metadata (who ran what when, where, why)
Domains for all incoming + outgoing mail, plus mail/attachment weights (useful for spotting outliers exfiltrating data)
Firewall: accepted incoming connections
Continuous Integration test results (software project, timings, memory, cpu footprints)
SVN/Git code commits (yes, it’s possible to log the entire change set)
JIRA tickets (who, what, when, project, component, priority)
ERP logs (supply chain, logistics, stock control, manufacturing lead times)
CRM + online store logs (customer info, helpdesk cases, orders)
anything and everything else with vaguely any business value

I think it’s pretty obvious that all this stuff taken together constitutes what most people call Big Data these days. There’s quite a distinction between that sort of mixed relational data and the plainer “lots of data” I deal with day to day, experimental data in the order of a terabyte-plus per device per day.

For some time now at Sanger we’ve been looking at the problems and solutions involved with building services supporting what are likely to become some of the biggest databases on the planet. The biggest problem is there aren't too many people doing this kind of thing and who are willing to talk about it.

The data we’re storing falls into two categories. Short Read Format (SRF) files containing sequence, quality and trace (~10Gb per lane) data and FastQ containing sequence and quality (~1Gb per lane).

Our requirements for these data are fundamentally for two different systems. One is a long-term archival system for SRF, the responsibility for which will eventually be shifted to the EBI . The second is, for me at least, the more interesting system –

The short-term storage of reads and qualities (and possibly also for selected alignments) isn’t the biggest problem – that honour is left to the fast, parallel retrieval of the same. The underlying data store needs to grow at a respectable 12TB per year and serve maybe a hundred simultaneous users requesting up to 1000 sequences per second.

Transfer times for reads are small but as a result are disproportionately affected by artefacts like TCP setup times, HTTP header payloads and certainly index seek times.

We’re looking at a few horizontally-scaling solutions for performing these kinds of jobs – the most obvious are tools like MapReduce and equivalents like Hadoop running with Nutch . My personal favourite and the one I’m holding out for is MogileFS from the same people who brought you Memcached . Time to get benchmarking!

Updated: Loved this via Brad

Category: bigdata

Systems & Security Tools du jour

What is Splunk?

How does it work?

How far can you go with it?

Massively Parallel Sequence Archive