programming - psyphi.net blog

A lot of what I do day-to-day is related to optimisation. Be it Perl code, SQL queries, Javascript or HTML there are usually at least a couple of cracking examples I find every week. On Friday I came across this:

SELECT cycle FROM goldcrest WHERE id_run = ?

This query is being used to find the number of the latest cycles (between 1 and 37 for each id_run) in a near-real-time tracking system and is used several times whenever a run report is viewed.

EXPLAIN SELECT cycle FROM goldcrest WHERE id_run = 231;
  
+----+-------------+-----------+------+---------------+---------+---------+-------+--------+-------------+
| id | select_type | table Â  Â  | type | possible_keys | key Â  Â  | key_len | ref Â  | rows Â  | Extra Â  Â  Â  |
+----+-------------+-----------+------+---------------+---------+---------+-------+--------+-------------+
| Â 1 | SIMPLE Â  Â  Â | goldcrest | ref Â | g_idrun Â  Â  Â  | g_idrun | Â  Â  Â  8 | const | 262792 | Using where |
+----+-------------+-----------+------+---------------+---------+---------+-------+--------+-------------+

In itself this would be fine but the goldcrest table in this instance contains several thousand rows for each id_run. So, for id_run, let’s say, 231 this query happens to return approximately 588,000 rows to determine that the latest cycle for run 231 is the number 34.

To clean this up we first try something like this:

SELECT MIN(cycle),MAX(cycle) FROM goldcrest WHERE id_run = ?

which still scans the 588000 rows (keyed on id_run incidentally) but doesn’t actually return them to the user, only one row containing both values we’re interested in. Fair enough, the CPU and disk access penalties are similar but the data transfer penalty is significantly improved.

Next I try adding an index against the id_run and cycle columns:

ALTER TABLE goldcrest ADD INDEX(id_run,cycle);
Query OK, 37589514 rows affected (23 min 6.17 sec)
Records: 37589514 Â Duplicates: 0 Â Warnings: 0

Now this of course takes a long time and, because the tuples are fairly redundant, creates a relatively inefficient index, also penalising future INSERTs. However, casually ignoring those facts, our query performance is now radically different:

EXPLAIN SELECT MIN(cycle),MAX(cycle) FROM goldcrest WHERE id_run = 231;
  
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key Â | key_len | ref Â | rows | Extra Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| Â 1 | SIMPLE Â  Â  Â | NULL Â | NULL | NULL Â  Â  Â  Â  Â | NULL | Â  Â NULL | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
  
SELECT MIN(cycle),MAX(cycle) FROM goldcrest WHERE id_run = 231;
+------------+------------+
| MIN(cycle) | MAX(cycle) |
+------------+------------+
| Â  Â  Â  Â  Â 1 | Â  Â  Â  Â  37 |
+------------+------------+
  
1 row in set (0.01 sec)

That looks a lot better to me now!

Generally I try to steer clear of the mysterious internal workings of database engines, but with much greater frequency come across examples like this:

sub clone_type {
  my ($self, $clone_type, $clone) = @_;
  my %clone_type;

  if($clone and $clone_type) {
    $clone_type{$clone} = $clone_type;
    return $clone_type{$clone};
  }

  return;
}

Thankfully this one’s pretty quick to figure out – they’re usually *much* more convoluted, but still.. Huh??

Pass in a clone_type scalar, create a local hash with the same name (Argh!), store the clone_type scalar in the hash keyed at position $clone, then return the same value we just stored.

I don’t get it… maybe a global hash or something else would make sense, but this works out the same:

sub clone_type {
  my ($self, $clone_type, $clone) = @_;

  if($clone and $clone_type) {
    return $clone_type;
  }
  return;
}

and I’m still not sure why you’d want to do that if you have the values on the way in already.

Programmers really need to think around the problem, not just through it. Thinking through may result in functionality but thinking around results in both function and performance which means a whole lot more in my book, and incidentally, why it seems so hard to hire good programmers.

All the roles I’ve had during my time at Sanger have more or less required the development of production quality Perl code, usually OO and increasingly using MVC patterns. Why is it then that very nearly every Perl developer I’ve interviewed in the past 8 years is woefully lacking, specifically in OO Perl but more generally in half-decent programming skills?

It’s been astonishing, not in a good way, how many have been unable to demonstrate use of hashes. Some have been too scared of them (their words, not mine) and some have never felt the need. For those of you who aren’t Perl programmers, hashes (aka associative arrays) are a pretty crucial feature of the language and fundamental to its OO implementation.

Now I program in Perl sometimes more than 7-8 hours a day. For many years this also involved reworking other people’s code. I can very easily say that if you claim to be a Perl programmer and have never used hashes then you’re not going to get a Perl-related job because of your technical skills. With a good, interactive and engaging personality and a desire for self-improvement you might get away with it, but certainly not on technical merit.

It’s also quite worrying how many of these interviewees are unable to describe the basics of object-oriented programming yet have, for example, developed and sold a commercial ERP system, presumably for big bucks. Man, these people must have awesome marketing!

Frankly a number of the bioinformaticians already working there have similar skills to the interviewees and often worse communication skills, so maybe I’m simply setting my standards too high.

I really hope this situation improves when Perl 6 goes public though I’m sure it’ll take longer to become common parlance. As long as it happens before those smug RoR types take over the world I’ll be happy ;)

Tag: programming

Great pieces of code

Hiring Perl Developers – how hard can it be?