naïve kmer scanner

Another bit of fun, basically the opposite of yesterday’s post, here we’re detecting the number of unique kmers present in a sequence. It’s easy to do this with an iterating substr approach but I like Perl’s regex engine a lot so I wanted to do it using that. Okay, I wanted to do it entirely in one /e regex but it’s slightly trickier and a lot less clear manipulating pos inside a /e substitution function.

#!/usr/local/bin/perl
use strict;
use warnings;

my $str   = q[AAACAATAAGAAGCACCATCAGTACTATTAGGACGATGAGGCCCTCCGCTTCTGCGTCGGTTTGTGGG];
my $k     = 3;
my $match = q[\s*[ACTG]\s*]x$k;
my $seen  = {};

while($str =~ m{($match)}smxgi) {
  my $m = $1;
  $m    =~ s/\s*//smxg;

  $seen->{$m}++;

  pos $str = (pos $str) - $k + 1;
}

{
  local $, = "\n";
  print sort keys %{$seen};
}

printf "\n%d unique ${k}mers\n", scalar keys %{$seen};

$k is the size of the kmers we’re looking for. In this case 3, as we were generating yesterday.
$match attempts to take care of matches across newlines, roughly what one might find inside a FASTA. YMMV.
$seen keeps track of uniques we’ve encountered so far in $str.

The while loop iterates through matches found by the regex engine and pos, a function you don’t see too often, resets the start position for the next match, in this case to the current position minus 1 less than the length of the match (pos – k + 1).

The output looks something like this:


elwood:~/dev rmp$ ./kmers 
AAA
AAC
AAG
AAT
ACA
ACC
ACG
ACT
AGA
AGC
AGG
AGT
ATA
ATC
ATG
ATT
CAA
CAC
CAG
CAT
CCA
CCC
CCG
CCT
CGA
CGC
CGG
CGT
CTA
CTC
CTG
CTT
GAA
GAC
GAG
GAT
GCA
GCC
GCG
GCT
GGA
GGC
GGG
GGT
GTA
GTC
GTG
GTT
TAA
TAC
TAG
TAT
TCA
TCC
TCG
TCT
TGA
TGC
TGG
TGT
TTA
TTC
TTG
TTT
64 unique 3mers

If I were really keen I’d make use this in a regression test for yesterday’s toy.

One thought on “naïve kmer scanner”

Comments are closed.