Haplotype Consensus Clustering

Way back in the annals of history (2002) I wrote a bit of code to perform haplotype groupings for early Ensembl-linked data. Like my recent kmer scanner example, it used one of my favourite bits of Perl – the regex engine. I dug the old script out of an backup and it was, as you’d expect, completely horrible. So for fun I gave it a makeover this evening, in-between bits of Silent Witness.

This is what it looks like now. Down to 52 lines of code from 118 lines in the 2002 version. I guess the last 10 years have made me a little over twice as concise.

#!/usr/local/bin/perl -T
use strict;
use warnings;

#########
# read everything in
#
my $in = [];
while(my $line = <>) {
  chomp $line;
  if(!$line) {
    next;
  }

  #########
  # build regex pattern
  #
  $line =~ s{[X]}{.}smxg;

  #########
  # store
  #
  push @{$in}, uc $line;
}

my $consensii = {};

#########
# iterate over inputs
#
SEQ: for my $seq (sort { srt($a, $b) } @{$in}) {
  #########
  # iterate over consensus sequences so far
  #
  for my $con (sort { srt($a, $b) } keys %{$consensii}) {
    if($seq =~ /^$con$/smx ||
       $con =~ /^$seq$/smx) {
      #########
      # if input matches consensus, store & jump to next sequence
      #
      push @{$consensii->{$con}}, $seq;
      next SEQ;
    }
  }

  #########
  # if no match was found, create a new consensus container
  #
  $consensii->{$seq} = [$seq];
}

#########
# custom sort routine
# - firstly sort by sequence length
# - secondly sort by number of "."s (looseness)
#
sub srt {
  my ($x, $y) = @_;
  my $lx = length $x;
  my $ly = length $y;

  if($lx < $ly) {
    return -1;
  }
  if($ly > $lx) {
    return 1;
  }

  my $nx = $x =~ tr{.}{.};
  my $ny = $y =~ tr{.}{.};

  return $nx < => $ny;
}

#########
# tally and print everything out
#
while(my ($k, $v) = each %{$consensii}) {
  $k =~ s/[.]/X/sxmg;
  print $k, " [", (scalar @{$v}) ,"]\n";
  for my $m (@{$v}) {
    $m =~ s/[.]/X/sxmg;
    print "  $m\n";
  }
}

The input file looks something like this:

ACTGXTGC
ACTGATGC
ACTGTTGC
ACTGCTGC
ACTGGTGC
ACTXCTGC
ACXGCTGC
ACTGXTGC
CTGCTGC
CTGGTGC
CTXCTGC
CXGCTGC
CTGXTGC

ACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTG
ACTGXTGACTGACTG
ACTGACTGACTXACTG
ACXTGACTGACTGACTG

and the output looks a little like this – consensus [number of sequences] followed by an indented list of matching sequences:

elwood:~/dev rmp$ ./haplotype-sort < haplotype-in.txt 
ACTGXTGACTGACTG [1]
  ACTGXTGACTGACTG
ACTGATGC [1]
  ACTGATGC
CTGCTGC [4]
  CTGCTGC
  CTXCTGC
  CXGCTGC
  CTGXTGC
ACTGCTGC [5]
  ACTGCTGC
  ACTGXTGC
  ACTXCTGC
  ACXGCTGC
  ACTGXTGC
ACTGACTGACTGACTGACTG [2]
  ACTGACTGACTGACTGACTG
  ACTGACTGACTGACTGACTG
ACTGTTGC [1]
  ACTGTTGC
CTGGTGC [1]
  CTGGTGC
ACTGACTGACTXACTG [1]
  ACTGACTGACTXACTG
ACXTGACTGACTGACTG [1]
  ACXTGACTGACTGACTG
ACTGGTGC [1]
  ACTGGTGC