A Little SQL “EXPLAIN” example

http://www.flickr.com/photos/banyan_tree/2802823634/sizes/m/

Background: I have a big table called “channel” with a few hundred thousand rows in – nothing vast, but big enough to cause some queries to run slower than I want.

Today I was fixing something else and happened to run

show full processlist;

I noticed this taking too long:

SELECT payload FROM channel  LIMIT 398800,100;

This is a query used by some web-app paging code. Stupid really – payload isn’t indexed and there’s no use of any other keys in that query. Ok – how to improve it? First of all, see what EXPLAIN says:

mysql> explain SELECT payload FROM channel  LIMIT 398800,100;
+----+-------------+---------+------+---------------+------+---------+------+--------+-------+
| id | select_type | table   | type | possible_keys | key  | key_len | ref  | rows   | Extra |
+----+-------------+---------+------+---------------+------+---------+------+--------+-------+
|  1 | SIMPLE      | channel | ALL  | NULL          | NULL | NULL    | NULL | 721303 |       |
+----+-------------+---------+------+---------------+------+---------+------+--------+-------+
1 row in set (0.00 sec)

Ok. So the simplest way would be to limit a selection of id_channel (the primary key) then select payloads in that set. First I tried this:

SELECT payload
FROM channel
WHERE id_channel IN (
    SELECT id_channel FROM channel LIMIT 398800,100
);

Seems straightforward, right? No, not really.

ERROR 1235 (42000): This version of MySQL doesn't yet support
  'LIMIT & IN/ALL/ANY/SOME subquery'

Next!

Second attempt, using a temporary table, selecting and saving the id_channels I’m interested in then using those in the actual query:

CREATE TEMPORARY TABLE channel_tmp(
  id_channel BIGINT UNSIGNED NOT NULL PRIMARY KEY
) ENGINE=innodb;

INSERT INTO channel_tmp(id_channel)
  SELECT id_channel
  FROM channel LIMIT 398800,100;

SELECT payload
  FROM channel
  WHERE id_channel IN (
    SELECT id_channel FROM channel_tmp
  );
mysql> explain select id_channel from channel limit 398800,100;
+----+-------------+---------+-------+---------------+---------+---------+------+--------+-------------+
| id | select_type | table   | type  | possible_keys | key     | key_len | ref  | rows   | Extra       |
+----+-------------+---------+-------+---------------+---------+---------+------+--------+-------------+
|  1 | SIMPLE      | channel | index | NULL          | PRIMARY | 8       | NULL | 722583 | Using index |
+----+-------------+---------+-------+---------------+---------+---------+------+--------+-------------+
1 row in set (0.00 sec)
mysql> explain select payload from channel where id_channel in (select id_channel from channel_tmp);
+----+--------------+-------------+-------------+---------------+---------+---------+------+--------+-------------+
| id | select_type  | table       | type            | poss_keys | key     | key_len | ref  | rows   | Extra       |
+----+--------------+-------------+-------------+---------------+---------+---------+------+--------+-------------+
|  1 | PRIMARY      | channel     | ALL             | NULL      | NULL    | NULL    | NULL | 722327 | Using where |
|  2 | DEP SUBQUERY | channel_tmp | unique_subquery | PRIMARY   | PRIMARY | 8       | func |      1 | Using index |
+----+--------------+-------------+-------------+---------------+---------+---------+------+--------+-------------+
2 rows in set (0.00 sec)

Let’s try a self-join doing all of the above without explicitly making a temporary table. Self-joins can be pretty powerful – neat in the right places..

mysql> explain SELECT payload
  FROM channel c1,
       (SELECT id_channel FROM channel limit 398800,100) c2
  WHERE c1.id_channel=c2.id_channel;
+----+-------------+------------+--------+---------------+---------+---------+---------------+--------+-------------+
| id | select_type | table      | type   | possible_keys | key     | key_len | ref           | rows   | Extra       |
+----+-------------+------------+--------+---------------+---------+---------+---------------+--------+-------------+
|  1 | PRIMARY     | derived2   | ALL    | NULL          | NULL    | NULL    | NULL          |    100 |             |
|  1 | PRIMARY     | c1         | eq_ref | PRIMARY       | PRIMARY | 8       | c2.id_channel |      1 |             |
|  2 | DERIVED     | channel    | index  | NULL          | PRIMARY | 8       | NULL          | 721559 | Using index |
+----+-------------+------------+--------+---------------+---------+---------+---------------+--------+-------------+
3 rows in set (0.21 sec)

This pulls out the right rows and even works around the “no limit in subselect” unsupported mysql feature but that id_channel selection in c2 still isn’t quite doing the right thing – I don’t like all the rows being returned, even if they’re coming straight out of the primary key index.

A little bit of rudimentary benchmarking appears to suggest that the self-join is the fastest, followed by the original query at approximately one order of magnitude slower and trailing a long way behind at around another four-times slower than that, the temporary table. I’m not sure how or why the temporary table performance happens to be the slowest – perhaps down to storage access, or more likely my lack of understanding. Some time I might even try the in-memory table too for comparison.