I’ve been to two events in the past two weeks which have started me thinking harder about the way we protect and measure our enterprise systems.
The first of the two events was the fourth Splunk Live in St. Paul’s, London last week. I’ve been a big fan of Splunk for a few years but I’ve never really tried it out in production. The second was InfoSec at Earl’s Court. More about that one later.
What is Splunk?
To be honest, splunk is different things to different people. Since inception it’s had great value as a log collation and event alerting tool for systems administrators as that was what it was originally designed to do. However as both DJ Skillman and Godfrey Sullivan pointed out, Splunk has grown into a lot more than that. It solved a lot of “Big Data” (how I hate that phrase) problems before Big Data was trendy, taking arbitrary unstructured data sources structuring them in useful ways, indexing the hell out of them and adding friendly, near-real-time reporting and alerting on top. Nowadays, given the right data sources, Splunk is capable of providing across-the-board Operational Intelligence, yielding tremendous opportunities in measuring value of processes and events.
How does it work?
In order to make the most out of a Splunk installation you require at least three basic things :-
- A data source – anything from a basic syslog or Apache web server log to a live high level ERP logistics event feed or even entire code commits
- An enrichment process – something to tag packets, essentially to assign value to indexed fields, allowing the association of fields from different feeds, e.g. tallying new orders with a customer database with stock keeping perhaps.
- A report – a canned report, presented on a dashboard for your CFO for example, or an email alert to tell your IT manager that someone squirting 5 day experiments in at the head of the analysis pipeline is going to go over-budget on your AWS analysis pipeline in three days’ time.
How far can you go with it?
Well, here’s a few of the pick ‘n’ mix selection of things I’d like to start indexing as soon as we sort out a) the restricted data limits of our so-far-free Splunk installation and b) what’s legal to do
- Door id access (physical site presence)
- VPN logins (virtual site presence)
- Wifi device registrations (guest, internal, whatever)
- VoIP + PSTN call logs (number, duration)
- Environmentals – temperature, humidity of labs, offices, server rooms
- System logs for everything (syslog, authentication, Apache, FTPd, MySQL connections, Samba, the works)
- SGE job logs with user & project accounting
- Application logs for anything we’ve written in house
- Experimental metadata (who ran what when, where, why)
- Domains for all incoming + outgoing mail, plus mail/attachment weights (useful for spotting outliers exfiltrating data)
- Firewall: accepted incoming connections
- Continuous Integration test results (software project, timings, memory, cpu footprints)
- SVN/Git code commits (yes, it’s possible to log the entire change set)
- JIRA tickets (who, what, when, project, component, priority)
- ERP logs (supply chain, logistics, stock control, manufacturing lead times)
- CRM + online store logs (customer info, helpdesk cases, orders)
- anything and everything else with vaguely any business value
I think it’s pretty obvious that all this stuff taken together constitutes what most people call Big Data these days. There’s quite a distinction between that sort of mixed relational data and the plainer “lots of data” I deal with day to day, experimental data in the order of a terabyte-plus per device per day.