{"id":584,"date":"2012-05-07T18:30:12","date_gmt":"2012-05-07T18:30:12","guid":{"rendered":"http:\/\/psyphi.net\/blog\/?p=584"},"modified":"2012-05-06T21:59:46","modified_gmt":"2012-05-06T21:59:46","slug":"haplotype-consensus-clustering","status":"publish","type":"post","link":"https:\/\/psyphi.net\/blog\/2012\/05\/haplotype-consensus-clustering\/","title":{"rendered":"Haplotype Consensus Clustering"},"content":{"rendered":"<p>Way back in the annals of history (2002) I wrote a bit of code to perform haplotype groupings for early Ensembl-linked data. Like my recent <a href=\"http:\/\/psyphi.net\/blog\/2012\/05\/naive-kmer-scanner\/\" title=\"na\u00c3\u00afve kmer scanner\">kmer scanner<\/a> example, it used one of my favourite bits of Perl &#8211; the regex engine. I dug the old script out of an backup and it was, as you&#8217;d expect, completely horrible. So for fun I gave it a makeover this evening, in-between bits of Silent Witness.<\/p>\n<p>This is what it looks like now. Down to 52 lines of code from 118 lines in the 2002 version. I guess the last 10 years have made me a little over twice as concise.<\/p>\n<pre><code>#!\/usr\/local\/bin\/perl -T\r\nuse strict;\r\nuse warnings;\r\n\r\n#########\r\n# read everything in\r\n#\r\nmy $in = [];\r\nwhile(my $line = &lt;&gt;) {\r\n  chomp $line;\r\n  if(!$line) {\r\n    next;\r\n  }\r\n\r\n  #########\r\n  # build regex pattern\r\n  #\r\n  $line =~ s{[X]}{.}smxg;\r\n\r\n  #########\r\n  # store\r\n  #\r\n  push @{$in}, uc $line;\r\n}\r\n\r\nmy $consensii = {};\r\n\r\n#########\r\n# iterate over inputs\r\n#\r\nSEQ: for my $seq (sort { srt($a, $b) } @{$in}) {\r\n  #########\r\n  # iterate over consensus sequences so far\r\n  #\r\n  for my $con (sort { srt($a, $b) } keys %{$consensii}) {\r\n    if($seq =~ \/^$con$\/smx ||\r\n       $con =~ \/^$seq$\/smx) {\r\n      #########\r\n      # if input matches consensus, store &amp; jump to next sequence\r\n      #\r\n      push @{$consensii-&gt;{$con}}, $seq;\r\n      next SEQ;\r\n    }\r\n  }\r\n\r\n  #########\r\n  # if no match was found, create a new consensus container\r\n  #\r\n  $consensii-&gt;{$seq} = [$seq];\r\n}\r\n\r\n#########\r\n# custom sort routine\r\n# - firstly sort by sequence length\r\n# - secondly sort by number of \".\"s (looseness)\r\n#\r\nsub srt {\r\n  my ($x, $y) = @_;\r\n  my $lx = length $x;\r\n  my $ly = length $y;\r\n\r\n  if($lx &lt; $ly) {\r\n    return -1;\r\n  }\r\n  if($ly &gt; $lx) {\r\n    return 1;\r\n  }\r\n\r\n  my $nx = $x =~ tr{.}{.};\r\n  my $ny = $y =~ tr{.}{.};\r\n\r\n  return $nx &lt; =&gt; $ny;\r\n}\r\n\r\n#########\r\n# tally and print everything out\r\n#\r\nwhile(my ($k, $v) = each %{$consensii}) {\r\n  $k =~ s\/[.]\/X\/sxmg;\r\n  print $k, \" [\", (scalar @{$v}) ,\"]\\n\";\r\n  for my $m (@{$v}) {\r\n    $m =~ s\/[.]\/X\/sxmg;\r\n    print \"  $m\\n\";\r\n  }\r\n}<\/code><\/pre>\n<p>The input file looks something like this:<\/p>\n<pre><code>ACTGXTGC\r\nACTGATGC\r\nACTGTTGC\r\nACTGCTGC\r\nACTGGTGC\r\nACTXCTGC\r\nACXGCTGC\r\nACTGXTGC\r\nCTGCTGC\r\nCTGGTGC\r\nCTXCTGC\r\nCXGCTGC\r\nCTGXTGC\r\n\r\nACTGACTGACTGACTGACTG\r\nACTGACTGACTGACTGACTG\r\nACTGXTGACTGACTG\r\nACTGACTGACTXACTG\r\nACXTGACTGACTGACTG<\/code><\/pre>\n<p>and the output looks a little like this &#8211; consensus [number of sequences] followed by an indented list of matching sequences:<\/p>\n<pre><code>elwood:~\/dev rmp$ .\/haplotype-sort &lt; haplotype-in.txt \r\nACTGXTGACTGACTG [1]\r\n  ACTGXTGACTGACTG\r\nACTGATGC [1]\r\n  ACTGATGC\r\nCTGCTGC [4]\r\n  CTGCTGC\r\n  CTXCTGC\r\n  CXGCTGC\r\n  CTGXTGC\r\nACTGCTGC [5]\r\n  ACTGCTGC\r\n  ACTGXTGC\r\n  ACTXCTGC\r\n  ACXGCTGC\r\n  ACTGXTGC\r\nACTGACTGACTGACTGACTG [2]\r\n  ACTGACTGACTGACTGACTG\r\n  ACTGACTGACTGACTGACTG\r\nACTGTTGC [1]\r\n  ACTGTTGC\r\nCTGGTGC [1]\r\n  CTGGTGC\r\nACTGACTGACTXACTG [1]\r\n  ACTGACTGACTXACTG\r\nACXTGACTGACTGACTG [1]\r\n  ACXTGACTGACTGACTG\r\nACTGGTGC [1]\r\n  ACTGGTGC<\/code><\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Way back in the annals of history (2002) I wrote a bit of code to perform haplotype groupings for early Ensembl-linked data. Like my recent kmer scanner example, it used one of my favourite bits of Perl &#8211; the regex engine. I dug the old script out of an backup and it was, as you&#8217;d &hellip; <a href=\"https:\/\/psyphi.net\/blog\/2012\/05\/haplotype-consensus-clustering\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Haplotype Consensus Clustering&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"footnotes":""},"categories":[11],"tags":[515,781,21],"class_list":["post-584","post","type-post","status-publish","format-standard","hentry","category-programming","tag-clustering","tag-haplotype","tag-perl"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/posts\/584","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/comments?post=584"}],"version-history":[{"count":2,"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/posts\/584\/revisions"}],"predecessor-version":[{"id":586,"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/posts\/584\/revisions\/586"}],"wp:attachment":[{"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/media?parent=584"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/categories?post=584"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/tags?post=584"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}