programming - psyphi.net blog

Make sure the website was up
Internal software development for projects without dedicated informatics support
Support for projects with dedicated informatics

When I started, back in 1999 things were pretty disorganised but in terms of user-requirements actually a little easier – projects had the odd CGI script but most data were shipped out using file dumps on the FTP site. You see back then and for the few years’ previous, it was the dawning of the world-wide-web and web-users were much happier being faced with an FTP/gopher file-listing of .gz (or more likely, uncompressed .fasta) files to download.

Back then we had a couple of small DEC servers which ran the external- and internal- (intranet) websites. Fine. Well, fine that is, until you want to make a change.

Revision Control: Manual

Ok. You want to make a change. You take your nph-blast_server.cgi and make a copy nph-blast_server2.cgi . You make your changes and test them on the external website. Great! It works! You mail a collaborator across the pond to try it out for bugs. Fab! Nothing found. Ok, so copy it back over nph-blast_server.cgi and everyone’s happy.

What’s wrong with this picture? Well, you remember that development copy? Firstly, it’s still there. You just multiplied your attack-vectors by two (assuming there are bugs in the script capable of being exploited). Secondly, and this is more harmful to long-term maintenance, that development copy is the URL you mailed your collaborator. It’s also the URL your collaborator mailed around to his 20-strong informatics team and they posted on bulletin boards and USENET groups for the rest of the world.

Luckily you have a dedicated and talented web-team who sort out this chaos using a pile of server redirects. Phew! Saved.

Now multiply this problem by the 150-or-so dedicated informatics developers on campus serving content through the core servers. Take that number and multiply it by the number of CGI scripts each developer produces a month.

That is then the number of server redirects which every incoming web request has to be checked against before it reaches its target page. Things can become pretty slow.

Enter the development (staging) service

What happens next is that the web support guys do something radical. They persuade all the web developers on site by hook or by crook that they shouldn’t be editing content on the live, production, public servers. Instead they should use an internal (and for special cases, IP-restricted-external-access) development service, test their content before pushing it live, then use a special command, let’s call it webpublish, to push everything live.

Now to the enlightened developer of today that doesn’t sound radical, it just sounds like common sense. You should have heard the wailing and gnashing of teeth!

Shared development

At this point I could, should go into the whys and wherefores of using revision control, but I’ll save that for another post. Instead I want to focus on the drawbacks of sharing. My feeling is that the scenario above is a fairly common one where there are many authors working on the same site. It works really well for static content, even when a CMS is used. Unfortunately it’s not so great for software development. The simple fact is that requirements diverge – both for the project and for the software stack. These disparate teams only converge in that they’re running on the same hardware, so why should the support team expect their software requirements to converge also?

Allow me to illustrate one of the problems.

Projects A and B are hosted on the same server. They use the same centrally-supported library L. A, B and L each have a version. They all work happily together at version A1B1L1. Now B needs a new feature, but to add it requires an upgrade to L2. Unfortunately the L2 upgrade breaks A1. Project A therefore is obliged to undertake additional (usually unforeseen) work just to retain current functionality.

Another situation is less subtle and involves shared-user access. For developers this is most likely the root superuser although in my opinion any shared account is equally bad. When using a common user it’s very difficult to know who made a change in the past, let alone who’s making a change right now. I observed a situation recently where two developers were simultaneously trying to build RPMs with rpmbuild which, by default, builds in a system location like /usr/share . Simultaneously trying to access the same folders leads to very unpredictable, unrepeatable results. Arguably the worst situation is when no errors are thrown during the build and neither developer notices!

Naturally a lot of the same arguments against shared development go for shared production too. The support matrix simply explodes with a few tens of applications each with different prerequisites.

Other options

Back in the day there were fewer options – one was left with always having to use relative paths and often having to discard all but the core system prerequisites in fear of them changing unexpectedly over time. Using relative paths is still a fairly inexpensive way to do things but sometimes it’s just too restrictive. There is another way…

Virtualisation is now commonplace. You probably cross-paths with a virtual machine every day without knowing it. They’re ubiquitous because they’re really, really useful. For our development purposes one core support member can build a standard, supported virtual machine image and post it on the intranet somewhere. All the other developers can take it, start their own instances of it and do all of their own development on their own hardware without fighting for common resources. Upgrades can be tested independently of one another. Machines can be restarted from scratch and so on. Once development is complete and given sufficient core resources, each developer can even bundle up their working image and ship it into production as is. No further core support required!

What tools can you use to do this? Parallels? Too commercial. VMWare? A bit lardy. Xen? Probably a bit too hard-core. KVM? Not quite mature enough yet. No, my current favourite in the virtualisation stakes is VirtualBox. Cross platform and free. Works great with Ubuntu inside. A killer combination capable of solving many of these sorts of problems.

The Importance of Profiling

I’ve worked as a software developer and worked with teams of software developers for around 10 years now, Many of those whom I’ve worked with have earned my trust and respect in relation to development and testing techniques. Frustratingly however it’s still with irritating regularity that I hear throw-away comments bourne of uncertainty and ignorance.

A couple of times now I’ve specifically been told that “GD makes my code go slow”. Now for those of you not in the know GD (actually specifically Lincoln Stein’s GD.pm in perl) is a wrapper around Tom Boutell’s most marvellous libgd graphics library. The combination of these two has always performed excellently for me and never been the bottleneck in any of my applications. The applications in question are usually database-backed web applications with graphics components for plotting genomic features or charts of one sort or another.

As any database-application developer will tell you, the database, or network connection to the database is almost always the bottleneck in an application or service. Great efforts are made to ensure database services scale well and perform as efficiently as possible, but even after these improvements are made they usually simply delay the inevitable.

Hence my frustration when I hear that “GD is making my (database) application go slow”. How? Where? Why? Where’s the proof? It’s no use blaming something, a library in this case, that’s out of your control. It’s hard to believe a claim like that without some sort of measurement.

So.. before pointing the finger, profile the code and make an effort to understand what the profiler is doing. In database applications profile your queries – use EXPLAIN, add indices, record SQL transcripts and time the results. Then profile the code which is manipulating those results.

Once the results are in of course, concentrate in the first instance on the parts with the most impact (e.g. 0.1 second off each iteration of a 1000x loop rather than 1 second from /int main/ ) – the low hanging fruit. Good programmers should be relatively lazy and speeding up code with the least amount of effort should be commonsense.