scaling - psyphi.net blog

Content Delivery Network (CDN) using Linode VPS

This month one of the neat things I’ve done was to set up a small content delivery network (CDN) for speedy downloading of files across the globe. For one reason and another (mostly the difficulty in doing this purely with DNS and the desire not to use AWS), I opted to do this using my favourite VPS provider, Linode. All in all (and give or take DNS propagation time) I reckon it’s possible to deploy a multi-site CDN in under 30 minutes given a bit of practice. Not too shabby!

For this recipe you will need:

Linode account
A domain name and DNS management

What you’ll end up with:

3x Ubuntu 12.04 LTS VPS, one each in London, Tokyo and California
3x NodeBalancers, one each in London, Tokyo and California
1x user-facing general web address
3x continent-facing web addresses

I’m going to use “mycdn.com” wherever I refer to my DNS / domain. You should substitute your domain name wherever you see it.

So, firstly log in to Linode.

Create three new Linode 1024 small VPSes (or whatever size you think you’ll need).Â I set mine up as Ubuntu 12.04 LTS with 512MB swap but otherwise nothing special. Set one each to be in London, Tokyo and Fremont. Set the root password on each. Under “Settings”, give each VPS a label. I called mine vps-<city>-01. Under “Remote Settings”, give each a private IP and note them down together with the VPS/data centre they’re in.

At this point it’s also useful (but not strictly necessary) to give each node a DNS CNAME for its external IP address, just so you can log in to them easily by name later.

Boot all three machines and check you can login to them. I find it useful here to do an

apt-get update ; apt-get dist-upgrade.

You can also now install Apache and mod_geoip on each node:

apt-get installÂ apache2 libapache2-mod-geoip
a2enmod include
a2enmod rewrite

You should now be able to bring up a web browser on each VPS (public IP or CNAME) in turn and see the default Apache “It works!” page.

Ok.. still with me? Next we’ll ask Linode to fire up three NodeBalancers, again one in each of the data centres, for each VPS. I labelled mine cdn-lb-<city>-01. Each one can be configured with a Port, 80 with, for now, the default settings. Add a host to each NodeBalancer with the private IP of each VPS, and the port, e.g. 192.168.128.123:80 . Note that each VPS hasn’t yet been configured to listen on those interfaces so each NodeBalancer won’t recognise its host as being up.

Ok. Let’s fix those private interfaces. SSH into each VPS using the root account and the password you set earlier. Edit /etc/network/interfaces and add:

auto eth0:1
iface eth0:1 inet static
	address <VPS private address here>
	netmask <VPS private netmask here>

Note that your private netmask is very unlikely to be 255.255.255.0 (probably) like your home network and yes, this does make a difference. Once that configuration is in, you can:

ifup eth0:1

Now we can add DNS CNAMEs for each NodeBalancer. Take the public IP for each NodeBalancer over to your DNS manager and add a meaningful CNAME for each one. I used continental regionsÂ americas,Â apac,Â europe, but you might prefer to be more specific than that (e.g.Â us-west,Â eu-west, …). Once the DNS propagates you should be able to see each of your Apache “It works!” pages again in your browser, but this time the traffic is running through the NodeBalancer (you might need to wait a few seconds before the NodeBalancer notices the VPS is now up).

Ok so let’s take stock. We have three VPS, each with a NodeBalancer and each running a web server. We could stop here and just present a homepage to each user telling them to manually select their local mirror – and some sites do that, but we can do a bit better.

Earlier we installed libapache2-mod-geoip. This includes a (free) database from MaxMind which maps IP address blocks to the continents they’re allocated to (via the ISP who’s bought them). The Apache module takes the database and sets a series of environment variables for each and every visitor IP. We can use this to have a good guess at roughly where a visitor is and bounce them out to the nearest of our NodeBalancers – magic!

So, let’s poke the Apache configuration a bit. rm /etc/apache2/sites-enabled/000-default. Create a new file /etc/apache2/sites-available/mirror.mycdn.com and give it the following contents:

<VirtualHost>
	ServerName mirror.mycdn.com
	ServerAlias *.mycdn.com
	ServerAdmin webmaster@mycdn.com

	DocumentRoot /mirror/htdocs

	DirectoryIndex index.shtml index.html

	GeoIPEnable Â  Â  On
	GeoIPScanProxyHeaders Â  Â  On

	RewriteEngine Â  Â  On

	RewriteCond %{HTTP_HOST} !americas.mycdn.com
	RewriteCond %{ENV:GEOIP_CONTINENT_CODE} NA|SA
	RewriteRule (.*) http://americas.mycdn.com$1 [R=permanent,L]

	RewriteCond %{HTTP_HOST} !apac.mycdn.com
	RewriteCond %{ENV:GEOIP_CONTINENT_CODE} AS|OC
	RewriteRule (.*) http://apac.mycdn.com$1 [R=permanent,L]

	RewriteCond %{HTTP_HOST} !europe.mycdn.com
	RewriteCond %{ENV:GEOIP_CONTINENT_CODE} EU|AF
	RewriteRule (.*) http://europe.mycdn.com$1 [R=permanent,L]

	<Directory />
		Order deny,allow
		Deny from all
		Options None
	</Directory>

	<Directory /mirror/htdocs>
		Order allow,deny
		Allow from all
		Options IncludesNoExec
	</Directory>
</VirtualHost>

Now ln -s /etc/apache2/sites-available/mirror.mycdn.com /etc/apache2/sites-enabled/ .

mkdir -p /mirror/htdocs to make your new document root and add a file called index.shtml there. The contents should look something like:

<html>
Â <body>
Â  <h1>MyCDN Test Page</h1>
Â  <h2><!--#echo var="HTTP_HOST" --></h2>
<!--#set var="mirror_eu" Â  Â  Â  value="http://europe.mycdn.com/" -->
<!--#set var="mirror_apac" Â  Â  value="http://apac.mycdn.com/" -->
<!--#set var="mirror_americas" value="http://americas.mycdn.com/" -->

<!--#if expr="${GEOIP_CONTINENT_CODE} == AF"-->
Â <!--#set var="continent" value="Africa"-->
Â <!--#set var="mirror" value="${mirror_eu}"-->

<!--#elif expr="${GEOIP_CONTINENT_CODE} == AS"-->
Â <!--#set var="continent" value="Asia"-->
Â <!--#set var="mirror" value="${mirror_apac}"-->

<!--#elif expr="${GEOIP_CONTINENT_CODE} == EU"-->
Â <!--#set var="continent" value="Europe"-->
Â <!--#set var="mirror" value="${mirror_eu}"-->

<!--#elif expr="${GEOIP_CONTINENT_CODE} == NA"-->
Â <!--#set var="continent" value="North America"-->
Â <!--#set var="mirror" value="${mirror_americas}"-->

<!--#elif expr="${GEOIP_CONTINENT_CODE} == OC"-->
Â <!--#set var="continent" value="Oceania"-->
Â <!--#set var="mirror" value="${mirror_apac}"-->

<!--#elif expr="${GEOIP_CONTINENT_CODE} == SA"-->
Â <!--#set var="continent" value="South America"-->
Â <!--#set var="mirror" value="${mirror_americas}"-->
<!--#endif -->
<!--#if expr="${GEOIP_CONTINENT_CODE}"-->
Â <p>
Â  You appear to be in <!--#echo var="continent"-->.
  Your nearest mirror is <a href="<!--#echo var="mirror" -->"><!--#echo var="mirror" --></a>.
Â </p>
Â <p>
Â  Or choose from one of the following:
Â </p>
<!--#else -->
Â <p>
Â  Please choose your nearest mirror:
Â </p>
<!--#endif -->

<ul>
Â <li><a href="<!--#echo var="mirror_eu" Â  Â  Â  -->"><!--#echo var="mirror_eu"Â  Â  Â  Â  --></a> Europe (London)</a></li>
Â <li><a href="<!--#echo var="mirror_apac" Â  Â  -->"><!--#echo var="mirror_apac"Â  Â  Â  --></a> Asia/Pacific (Tokyo)</a></li>
Â <li><a href="<!--#echo var="mirror_americas" -->"><!--#echo var="mirror_americas"Â  --></a> USA (Fremont, CA)</a></li>
</ul>

<pre style="color:#ccc;font-size:smaller">
http-x-forwarded-for=<!--#echo var="HTTP_X_FORWARDED_FOR" -->
GEOIP_CONTINENT_CODE=<!--#echo var="GEOIP_CONTINENT_CODE" -->
</pre>
Â </body>
</html>

Then apachectl restart to pick up the new virtualhost and visit each one of your NodeBalancer CNAMEs in turn. The ones which aren’t local to you should redirect you out to your nearest server.

Pretty neat! The last step is to add a user-facing A record, I used mirror.mycdn.com, and set it up to DNS-RR (Round-Robin) the addresses of the three NodeBalancers. Now Set up a cron job to rsync your content to the three target VPSes, or a script to push content on-demand. Job done!

For extra points:

Clone another VPS behind each NodeBalancer so that each continent is fault tolerant, meaning you can reboot one VPS in each pair without losing continental service.
Explore whether it’s safe to add the public IP of one Nodebalancer to the Host configuration of a NodeBalancer on another continent, effectively making a resilient loop.

Exa-, Peta-, Tera-scale Informatics: Are YOU in the cloud yet?

http://www.flickr.com/photos/pagedooley/2511369048/

One of the aspects of my job over the last few years, both at Sanger and now at Oxford Nanopore Technologies has been the management of tera-, verging on peta- scale data on a daily basis.

Various methods of handling filesystems this large have been around for a while now and I won’t go into them here. Building these filesystems is actually fairly straightforward as most of them are implemented as regular, repeatable units – great for horizontal scale-out.

No, what makes this a difficult problem isn’t the sheer volume of data, it’s the amount of churn. Churn can be defined as the rate at which new files are added and old files are removed.

To illustrate – when I left Sanger, if memory serves, we were generally recording around a terabyte of new data a day. The staging area there was around 0.5 Petabytes (using the Lustre filesystem) but didn’t balance correctly across the many disks. This meant we had to keep the utilised space below around 90% for fear of filling up an individual storage unit (and leading to unexpected errors). Ok, so that’s 450TB. That left 45 days of storage – one and a half months assuming no slack.

Fair enough. Sort of. collect the data onto the staging area, analyse it there and shift it off. Well, that’s easier said than done – you can shift it off onto slower, cheaper storage but that’s generally archival space so ideally you only keep raw data. If the raw data are too big then you keep the primary analysis and ditch the raw. But there’s a problem with that:

lots of clever people want to squeeze as much interesting stuff out of the raw data as possible using new algorithms.
They also keep finding things wrong with the primary analyses and so want to go back and reanalyse.
Added to that there are often problems with the primary analysis pipeline (bleeding-edge software bugs etc.).
That’s not mentioning the fact that nobody ever wants to delete anything

As there’s little or no slack in the system, very often people are too busy to look at their own data as soon as it’s analysed so it might sit there broken for a week or four. What happens then is there’s a scrum for compute-resources so they can analyse everything before the remaining 2-weeks of staging storage is up. Then even if there are problems found it can be too late to go back and reanalyse because there’s a shortage of space for new runs and stopping the instruments running because you’re out of space is a definite no-no!

What the heck? Organisationally this isn’t cool at all. Situations like this are only going to worsen! The technologies are improving all the time – run-times are increasing, read-lengths are increasing, base-quality is increasing, analysis is becoming better and more instruments are becoming available to more people who are using them for more things. That’s a many, many-fold increase in storage requirements.

So how to fix it? Well I can think of at least one pretty good way. Don’t invest in on-site long-term staging- or scratch-storage. If you’re worried by all means sort out an awesome backup system but nearline it or offline to a decent tape archive or something and absolutely do not allow user-access. Instead of long-term staging storage buy your company the fattest Internet pipe it can handle. Invest in connectivity, then simply invest in cloud storage. There are enough providers out there now to make this a competitive and interesting marketplace with opportunities for economies of scale.

What does this give you? Well, many benefits – here are a few:

virtually unlimited storage
only pay for what you use
accountable costs – know exactly how much each project needs to invest
managed by storage experts
flexible computing attached to storage on-demand
no additional power overheads
no additional space overheads

Most of those I more-or-less take for granted these days. The one I find interesting at the moment is the costing issue. It can be pretty hard to hold one centralised storage area accountable for different groups – they’ll often pitch in for proportion of the whole based on their estimated use compared to everyone else. With accountable storage offered by the cloud each group can manage and pay for their own space. The costs are transparent to them and the responsibility has been delegated away from central management. I think that’s an extremely attractive prospect!

The biggest argument I hear against cloud storage & computing is that your top secret, private data is in someone else’s hands. Aside from my general dislike of secret data, these days I still don’t believe this is a good argument. There are enough methods for handling encryption and private networking that this pretty-much becomes a non-issue. Encrypt the data on-site, store the keys in your own internal database, ship the data to the cloud and when you need to run analysis fetch the appropriate keys over an encrypted link, decode the data on demand, re-encrypt the results and ship them back. Sure the encryption overheads add expense to the operation but I think the costs are far outweighed.

Massively Parallel Sequence Archive

For some time now at Sanger we’ve been looking at the problems and solutions involved with building services supporting what are likely to become some of the biggest databases on the planet. The biggest problem is there aren't too many people doing this kind of thing and who are willing to talk about it.

The data we’re storing falls into two categories. Short Read Format (SRF) files containing sequence, quality and trace (~10Gb per lane) data and FastQ containing sequence and quality (~1Gb per lane).

Our requirements for these data are fundamentally for two different systems. One is a long-term archival system for SRF, the responsibility for which will eventually be shifted to the EBI . The second is, for me at least, the more interesting system –

The short-term storage of reads and qualities (and possibly also for selected alignments) isn’t the biggest problem – that honour is left to the fast, parallel retrieval of the same. The underlying data store needs to grow at a respectable 12TB per year and serve maybe a hundred simultaneous users requesting up to 1000 sequences per second.

Transfer times for reads are small but as a result are disproportionately affected by artefacts like TCP setup times, HTTP header payloads and certainly index seek times.

We’re looking at a few horizontally-scaling solutions for performing these kinds of jobs – the most obvious are tools like MapReduce and equivalents like Hadoop running with Nutch . My personal favourite and the one I’m holding out for is MogileFS from the same people who brought you Memcached . Time to get benchmarking!

Updated: Loved this via Brad