{"id":581,"date":"2012-05-05T17:21:19","date_gmt":"2012-05-05T17:21:19","guid":{"rendered":"http:\/\/psyphi.net\/blog\/?p=581"},"modified":"2012-05-06T22:22:10","modified_gmt":"2012-05-06T22:22:10","slug":"naive-kmer-scanner","status":"publish","type":"post","link":"https:\/\/psyphi.net\/blog\/2012\/05\/naive-kmer-scanner\/","title":{"rendered":"na\u00c3\u00afve kmer scanner"},"content":{"rendered":"<p>Another bit of fun, basically the opposite of <a href=\"http:\/\/psyphi.net\/blog\/2012\/05\/naive-kmer-sequence-generator\/\" title=\"na\u00c3\u00afve kmer sequence generator\">yesterday&#8217;s post<\/a>, here we&#8217;re detecting the number of unique kmers present in a sequence. It&#8217;s easy to do this with an iterating substr approach but I like Perl&#8217;s regex engine a lot so I wanted to do it using that. Okay, I wanted to do it entirely in one \/e regex but it&#8217;s slightly trickier and a lot less clear manipulating pos inside a \/e substitution function.<\/p>\n<pre><code>#!\/usr\/local\/bin\/perl\r\nuse strict;\r\nuse warnings;\r\n\r\nmy $str   = q[AAACAATAAGAAGCACCATCAGTACTATTAGGACGATGAGGCCCTCCGCTTCTGCGTCGGTTTGTGGG];\r\nmy $k     = 3;\r\nmy $match = q[\\s*[ACTG]\\s*]x$k;\r\nmy $seen  = {};\r\n\r\nwhile($str =~ m{($match)}smxgi) {\r\n  my $m = $1;\r\n  $m    =~ s\/\\s*\/\/smxg;\r\n\r\n  $seen-&gt;{$m}++;\r\n\r\n  pos $str = (pos $str) - $k + 1;\r\n}\r\n\r\n{\r\n  local $, = \"\\n\";\r\n  print sort keys %{$seen};\r\n}\r\n\r\nprintf \"\\n%d unique ${k}mers\\n\", scalar keys %{$seen};<\/code><\/pre>\n<p>$k is the size of the kmers we&#8217;re looking for. In this case 3, as we were generating yesterday.<br \/>\n$match attempts to take care of matches across newlines, roughly what one might find inside a FASTA. YMMV.<br \/>\n$seen keeps track of uniques we&#8217;ve encountered so far in $str.<\/p>\n<p>The while loop iterates through matches found by the regex engine and pos, a function you don&#8217;t see too often, resets the start position for the next match, in this case to the current position minus 1 less than the length of the match (pos &#8211; k + 1).<\/p>\n<p>The output looks something like this:<\/p>\n<pre><code>\r\nelwood:~\/dev rmp$ .\/kmers \r\nAAA\r\nAAC\r\nAAG\r\nAAT\r\nACA\r\nACC\r\nACG\r\nACT\r\nAGA\r\nAGC\r\nAGG\r\nAGT\r\nATA\r\nATC\r\nATG\r\nATT\r\nCAA\r\nCAC\r\nCAG\r\nCAT\r\nCCA\r\nCCC\r\nCCG\r\nCCT\r\nCGA\r\nCGC\r\nCGG\r\nCGT\r\nCTA\r\nCTC\r\nCTG\r\nCTT\r\nGAA\r\nGAC\r\nGAG\r\nGAT\r\nGCA\r\nGCC\r\nGCG\r\nGCT\r\nGGA\r\nGGC\r\nGGG\r\nGGT\r\nGTA\r\nGTC\r\nGTG\r\nGTT\r\nTAA\r\nTAC\r\nTAG\r\nTAT\r\nTCA\r\nTCC\r\nTCG\r\nTCT\r\nTGA\r\nTGC\r\nTGG\r\nTGT\r\nTTA\r\nTTC\r\nTTG\r\nTTT\r\n64 unique 3mers<\/code><\/pre>\n<p>If I were really keen I&#8217;d make use this in a regression test for yesterday&#8217;s toy.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Another bit of fun, basically the opposite of yesterday&#8217;s post, here we&#8217;re detecting the number of unique kmers present in a sequence. It&#8217;s easy to do this with an iterating substr approach but I like Perl&#8217;s regex engine a lot so I wanted to do it using that. Okay, I wanted to do it entirely &hellip; <a href=\"https:\/\/psyphi.net\/blog\/2012\/05\/naive-kmer-scanner\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;na\u00c3\u00afve kmer scanner&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"footnotes":""},"categories":[11],"tags":[21,778],"class_list":["post-581","post","type-post","status-publish","format-standard","hentry","category-programming","tag-perl","tag-sequence"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/posts\/581","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/comments?post=581"}],"version-history":[{"count":4,"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/posts\/581\/revisions"}],"predecessor-version":[{"id":587,"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/posts\/581\/revisions\/587"}],"wp:attachment":[{"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/media?parent=581"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/categories?post=581"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/psyphi.net\/blog\/wp-json\/wp\/v2\/tags?post=581"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}