These are my links for December 12th through February 18th:
Tag: clustering
Haplotype Consensus Clustering
Way back in the annals of history (2002) I wrote a bit of code to perform haplotype groupings for early Ensembl-linked data. Like my recent kmer scanner example, it used one of my favourite bits of Perl – the regex engine. I dug the old script out of an backup and it was, as you’d expect, completely horrible. So for fun I gave it a makeover this evening, in-between bits of Silent Witness.
This is what it looks like now. Down to 52 lines of code from 118 lines in the 2002 version. I guess the last 10 years have made me a little over twice as concise.
#!/usr/local/bin/perl -T
use strict;
use warnings;
#########
# read everything in
#
my $in = [];
while(my $line = <>) {
chomp $line;
if(!$line) {
next;
}
#########
# build regex pattern
#
$line =~ s{[X]}{.}smxg;
#########
# store
#
push @{$in}, uc $line;
}
my $consensii = {};
#########
# iterate over inputs
#
SEQ: for my $seq (sort { srt($a, $b) } @{$in}) {
#########
# iterate over consensus sequences so far
#
for my $con (sort { srt($a, $b) } keys %{$consensii}) {
if($seq =~ /^$con$/smx ||
$con =~ /^$seq$/smx) {
#########
# if input matches consensus, store & jump to next sequence
#
push @{$consensii->{$con}}, $seq;
next SEQ;
}
}
#########
# if no match was found, create a new consensus container
#
$consensii->{$seq} = [$seq];
}
#########
# custom sort routine
# - firstly sort by sequence length
# - secondly sort by number of "."s (looseness)
#
sub srt {
my ($x, $y) = @_;
my $lx = length $x;
my $ly = length $y;
if($lx < $ly) {
return -1;
}
if($ly > $lx) {
return 1;
}
my $nx = $x =~ tr{.}{.};
my $ny = $y =~ tr{.}{.};
return $nx < => $ny;
}
#########
# tally and print everything out
#
while(my ($k, $v) = each %{$consensii}) {
$k =~ s/[.]/X/sxmg;
print $k, " [", (scalar @{$v}) ,"]\n";
for my $m (@{$v}) {
$m =~ s/[.]/X/sxmg;
print " $m\n";
}
}
The input file looks something like this:
ACTGXTGC
ACTGATGC
ACTGTTGC
ACTGCTGC
ACTGGTGC
ACTXCTGC
ACXGCTGC
ACTGXTGC
CTGCTGC
CTGGTGC
CTXCTGC
CXGCTGC
CTGXTGC
ACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTG
ACTGXTGACTGACTG
ACTGACTGACTXACTG
ACXTGACTGACTGACTG
and the output looks a little like this – consensus [number of sequences] followed by an indented list of matching sequences:
elwood:~/dev rmp$ ./haplotype-sort < haplotype-in.txt
ACTGXTGACTGACTG [1]
ACTGXTGACTGACTG
ACTGATGC [1]
ACTGATGC
CTGCTGC [4]
CTGCTGC
CTXCTGC
CXGCTGC
CTGXTGC
ACTGCTGC [5]
ACTGCTGC
ACTGXTGC
ACTXCTGC
ACXGCTGC
ACTGXTGC
ACTGACTGACTGACTGACTG [2]
ACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTG
ACTGTTGC [1]
ACTGTTGC
CTGGTGC [1]
CTGGTGC
ACTGACTGACTXACTG [1]
ACTGACTGACTXACTG
ACXTGACTGACTGACTG [1]
ACXTGACTGACTGACTG
ACTGGTGC [1]
ACTGGTGC
Bookmarks for October 7th through October 13th
These are my links for October 7th through October 13th:
- Sector/Sphere: High Performance Distributed Data Storage and Processing –
- pdf417_encode | freshmeat.net –
- London Perl Workshop –
- pwauth – Project Hosting on Google Code – use with mod_authnz_external for pam authentication e.g. svn+https+dav+authnz_external+pwauth+pam+winbind+active directory
- mod-auth-external – Project Hosting on Google Code –