Another use for Selenium IDE

A dear friend of mine recently needed to recover all email from his mailbox. Normally this wouldn’t be a problem, there are plenty of options in any sane mail application – export or archive mailbox, select-all messages and “Send Again”/Redirect/Bounce to another address or at the very worst, select-all and forward. Most of these options are available with desktop mail applications – Pine, Squirrelmail, IMP, Outlook, Outlook Express, Windows Mail, Mail.app, Thunderbird, Eudora and I’m sure loads of others.

Unfortunately the only access provided was through Microsoft’s Outlook Web Access (2007). This, whilst being fairly pretty in Lite (non-Internet Explorer browsers) mode and prettier/heavier in MSIE, does not have any useful bulk forwarding or export functionality at all. None. Not desperately handy, to be sure.

Ok, so my first port of call was to connect my Mail.app which supports Exchange OWA access. No dice – spinning, hanging, no data. Hmm – odd. Ok, second I tried fetchExc a Java commandline tool which promised everything I needed but in the end delivered pretty obtuse error messages. After an hour’s fiddling I gave up with fetchExc and tried falling back to Perl with Email::Folder::Exchange. This had very similar results to fetchExc but a slightly different set of errors.

After much swearing and a lot more poking, probing and requesting of tips from other friends (thanks Ze) the OWA service was also found to be sitting behind Microsoft’s Internet Security and Acceleration server. This isn’t a product I’ve used before but I can only assume it’s an expensive reverse proxy, the sort of thing I’d compare to inexpensive Apache + mod_proxy + mod_security on a good day. This ISA service happened to block all remote SOAP (2000/2003) and WebDAV (2007/2010) access too. Great! No remote service access whatsoever.

Brute force to the rescue. I could, of course go in and manually forward each and every last mail, but that’s quite tedious and a huge amount of clicking and pasting in the same email address. Enter Selenium IDE.

Selenium is a suite of tools for remote controlling browsers, primarily for writing tests for interactive applications. I use it in my day to day work mostly for checking bits of dynamic javascript, DHTML, forms etc. are doing the right things when clicked/pressed/dragged and generally interacted with. OWA is just a (really badly written) webpage to interact with, after all.

I downloaded the excellent sideflow.js plugin which provides loop functionality not usually required for web app testing and after a bit of DOM inspection on the OWA pages I came up with the following plan –

  • click the subject link
  • click the “forward” button
  • enter the recipient address
  • click the send button
  • select the checkbox
  • press the “delete” button
  • repeat 500 times

The macro looked something like this:

<table cellpadding="1" cellspacing="1" border="1">
<thead>
<tr><td rowspan="1" colspan="3">owa-selenium-macro</td></tr>
</thead><tbody>
<tr>
	<td>getEval</td>
	<td>index=0</td>
	<td></td>
</tr>
<tr>
	<td>while</td>
	<td>index&lt;500</td>
	<td></td>
</tr>
<tr>
	<td>storeEval</td>
	<td>index</td>
	<td>value</td>
</tr>
<tr>
	<td>echo</td>
	<td>${value}</td>
	<td></td>
</tr>
<tr>
	<td>clickAndWait</td>
	<td>//table[1]/tbody/tr[2]/td[3]/table/tbody/tr[2]/td/div/table//tbody/tr[3]/td[6]/h1/a</td>
	<td></td>
</tr>
<tr>
	<td>clickAndWait</td>
	<td>id=lnkHdrforward</td>
	<td></td>
</tr>
<tr>
	<td>type</td>
	<td>id=txtto</td>
	<td>newaddress@gmail.com</td>
</tr>
<tr>
	<td>clickAndWait</td>
	<td>id=lnkHdrsend</td>
	<td></td>
</tr>
<tr>
	<td>click</td>
	<td>name=chkmsg</td>
	<td></td>
</tr>
<tr>
	<td>clickAndWait</td>
	<td>id=lnkHdrdelete</td>
	<td></td>
</tr>
<tr>
	<td>getEval</td>
	<td>index++</td>
	<td></td>
</tr>
<tr>
	<td>endWhile</td>
	<td></td>
	<td></td>
</tr>
</tbody></table>

So I logged in, opened each folder in turn and replayed the macro in Selenium IDE as many times as I needed to. Bingo! Super kludgy but it worked well, was entertaining to watch and ultimately did the job.

Bookmarks for May 1st through May 22nd

These are my links for May 1st through May 22nd:

SVN Server Integration with HTTPS, Active Directory, PAM & Winbind

Subversion on a whiteboard
Image CC by johntrainor
In this post I’d like to explain how it’s possible to integrate SVN (Subversion) source control using WebDAV and HTTPS using Apache and Active Directory to provide authentication and access control.

It’s generally accepted that SVN over WebDAV/HTTPS  provides finer granulation security controls than SVN+SSH. The problem is that SVN+SSH is really easy to set up, requiring knowledge of svnadmin and the filesystem and very little else but WebDAV+HTTPS requires knowledge of Apache and its modules relating to WebDAV, authentication and authorisation which is quite a lot more to ask. Add to that authenticating to AD and you have yourself a lovely string of delicate single point of failure components. Ho-hum, not a huge amount you can do about that but at least the Apache components are pretty robust.

For this article I’m using CentOS but everything should be transferrable to any distribution with a little tweakage.

Repository Creation

Firstly then, pick a disk or volume with plenty of space, we’re using make your repository – same as you would for svn+ssh:

svnadmin create /var/svn/repos

Apache Modules

Install the prerequisite Apache modules:

yum install mod_dav_svn

This should also install mod_authz_svn which we’ll also be making use of. Both should end up in Apache’s module directory, in this case /etc/httpd/modules/

Download and install mod_authnz_external from its Google Code page. This allows Apache basic authentication to hook into an external authentication mechanism. mod_authnz_external.so should end up in Apache’s module directory but in my case it ended up in its default location of /usr/lib/httpd/modules/.

Download and install the companion pwauth utility from its Google Code page. In my case it installs to /usr/local/sbin/pwauth and needs suexec permissions (granted using chmod +s).

Apache Configuration (HTTP)

ServerName svn.example.com
ServerAdmin me@example.com

Listen		*:80
NameVirtualHost *:80

User		nobody
Group		nobody

LoadModule setenvif_module	modules/mod_setenvif.so
LoadModule mime_module		modules/mod_mime.so
LoadModule log_config_module	modules/mod_log_config.so
LoadModule dav_module		modules/mod_dav.so
LoadModule dav_svn_module	modules/mod_dav_svn.so
LoadModule auth_basic_module    modules/mod_auth_basic.so
LoadModule authz_svn_module	modules/mod_authz_svn.so
LoadModule authnz_external_module modules/mod_authnz_external.so

LogFormat	"%v %A:%p %h %l %u %{%Y-%m-%d %H:%M:%S}t "%r" %>s %b "%{Referer}i" "%{User-Agent}i"" clean
CustomLog	/var/log/httpd/access_log	clean

<virtualhost *:80>
	ServerName	svn.example.com

	AddExternalAuth         pwauth  /usr/local/sbin/pwauth
	SetExternalAuthMethod   pwauth  pipe

	<location / >
		DAV			svn
		SVNPath			/var/svn/repos
		AuthType		Basic
		AuthName		"SVN Repository"
		AuthzSVNAccessFile	/etc/httpd/conf/authz_svn.acl
		AuthBasicProvider	external
		AuthExternal		pwauth
		Satisfy			Any

		<limitexcept GET PROPFIND OPTIONS REPORT>
			Require valid-user
		</limitexcept>
	</location>
</virtualhost>

Network Time (NTP)

In order to join a Windows domain, accurate and synchronised time is crucial, so you’ll need to be running NTPd.

yum install ntp
chkconfig ntpd on
ntpdate ntp.ubuntu.com
service ntpd start

Samba Configuration

Here’s where AD comes in and in my experience this is by far the most unreliable service. Install and configure samba:

yum install samba
chkconfig winbind on

Edit your /etc/samba/smb.conf to pull information from AD.

[global]
	workgroup = EXAMPLE
	realm = EXAMPLE.COM
	security = ADS
	allow trusted domains = No
	use kerberos keytab = Yes
	log level = 3
	log file = /var/log/samba/%m
	max log size = 50
	printcap name = cups
	idmap backend = idmap_rid:EXAMPLE=600-20000
	idmap uid = 600-20000
	idmap gid = 600-20000
	template shell = /bin/bash
	winbind enum users = Yes
	winbind enum groups = Yes
	winbind use default domain = Yes
	winbind offline logon = yes

Join the machine to the domain – you’ll need an account with domain admin credentials to do this:

net ads join -U administrator

Check the join is behaving ok:

[root@svn conf]# net ads info
LDAP server: 192.168.100.10
LDAP server name: ad00.example.com
Realm: EXAMPLE.COM
Bind Path: dc=EXAMPLE,dc=COM
LDAP port: 389
Server time: Tue, 15 May 2012 22:44:34 BST
KDC server: 192.168.100.10
Server time offset: 130

(Re)start winbind to pick up the new configuration:

service winbind restart

PAM & nsswitch.conf

PAM needs to know where to pull its information from, so we tell it about the new winbind service in /etc/pam.d/system-auth.

#%PAM-1.0
# This file is auto-generated.
# User changes will be destroyed the next time authconfig is run.
auth        required      pam_env.so
auth        sufficient    pam_unix.so nullok try_first_pass
auth        requisite     pam_succeed_if.so uid >= 500 quiet
auth        sufficient    pam_winbind.so try_first_pass
auth        required      pam_deny.so

account     required      pam_unix.so broken_shadow
account     sufficient    pam_localuser.so
account     sufficient    pam_succeed_if.so uid < 500 quiet
account     [default=bad success=ok user_unknown=ignore] pam_winbind.so
account     required      pam_permit.so

password    requisite     pam_cracklib.so try_first_pass retry=3
password    sufficient    pam_unix.so md5 shadow nullok try_first_pass use_authtok
password    sufficient    pam_winbind.so use_authtok
password    required      pam_deny.so

session     optional      pam_keyinit.so revoke
session     required      pam_limits.so
session     [success=1 default=ignore] pam_succeed_if.so service in crond quiet use_uid
session     required      /lib/security/pam_mkhomedir.so 
session     required      pam_unix.so
session     optional      pam_winbind.so

YMMV with PAM. It can take quite a lot of fiddling around to make it work perfectly. This obviously has an extremely close correlation to how flaky users find the authentication service. If you’re running on 64-bit you may find you need to install 64-bit versions of pam modules, e.g. mkhomedir which aren’t installed by default.

We also modify nsswitch.conf to tell other, non-pam aspects of the system where to pull information from:

passwd:     files winbind
shadow:     files winbind
group:      files winbind

To check the authentication information is coming back correctly you can use wbinfo but I like seeing data by using getent group or getent passwd. The output of these two commands will contain domain accounts if things are working correctly and only local system accounts otherwise.

External Authentication

We’re actually going to use system accounts for authentication. To stop people continuing to use svn+ssh (and thus bypassing the authorisation controls) we edit /etc/ssh/sshd_config and use AllowUsers or AllowGroups and specify all permitted users. Using AllowGroups will also provide AD group control of permitted logins but as the list is small it’s probably overkill. My sshd_config list looks a lot like this:

AllowUsers	root rmp contractor itadmin

To test external authentication run /usr/local/sbin/pwauth as below. “yay” should be displayed if things are working ok. Note the password here is displayed in clear-text:

[root@svn conf]# pwauth && echo 'yay' || echo 'nay'
rmp
mypassword

Access Controls

/etc/httpd/authz_svn.conf is the only part which should require any modifications over time – the access controls specify who is allowed to read and/or write to each svn project, in fact as everything’s a URL now you can arbitrarily restrict subfolders of projects too but that’s a little OTT. It can be arbitrarily extended and can take local and active directory usernames. I’m sure mod_authz_svn has full documentation about what you can and can’t put in here.

#
# Allow anonymous read access to everything by default.
#
[/]
* = r
rmp = rw

[/myproject]
rmp = rw
bob = rw

...

SSL

So far that’s all the basic components. The last piece in the puzzle is enabling SSL for Apache. I use the following /etc/httpd/httpd.conf:

ServerName svn.example.com
ServerAdmin me@example.com

Listen		*:80
NameVirtualHost *:80

User		nobody
Group		nobody

LoadModule setenvif_module	modules/mod_setenvif.so
LoadModule mime_module		modules/mod_mime.so
LoadModule log_config_module	modules/mod_log_config.so
LoadModule proxy_module		modules/mod_proxy.so
LoadModule proxy_http_module	modules/mod_proxy_http.so
LoadModule rewrite_module	modules/mod_rewrite.so
LoadModule dav_module		modules/mod_dav.so
LoadModule dav_svn_module	modules/mod_dav_svn.so
LoadModule auth_basic_module    modules/mod_auth_basic.so
LoadModule authz_svn_module	modules/mod_authz_svn.so
LoadModule ssl_module		modules/mod_ssl.so
LoadModule authnz_external_module modules/mod_authnz_external.so

Include conf.d/ssl.conf

LogFormat	"%v %A:%p %h %l %u %{%Y-%m-%d %H:%M:%S}t "%r" %>s %b "%{Referer}i" "%{User-Agent}i"" clean
CustomLog	/var/log/httpd/access_log	clean

<virtualhost *:80>
	ServerName		svn.example.com

	Rewrite		/	https://svn.example.com/	[R=permanent,L]
</virtualhost>

<virtualhost *:443>
	ServerName	svn.example.com

	AddExternalAuth         pwauth  /usr/local/sbin/pwauth
	SetExternalAuthMethod   pwauth  pipe

	SSLEngine on
	SSLProtocol all -SSLv2

	SSLCipherSuite		ALL:!ADH:!EXPORT:!SSLv2:RC4+RSA:+HIGH:+MEDIUM:+LOW
	SSLCertificateFile	/etc/httpd/conf/svn.crt
	SSLCertificateKeyFile	/etc/httpd/conf/svn.key

	<location />
		DAV			svn
		SVNPath			/var/svn/repos
		AuthType		Basic
		AuthName		"SVN Repository"
		AuthzSVNAccessFile	/etc/httpd/conf/authz_svn.acl
		AuthBasicProvider	external
		AuthExternal		pwauth
		Satisfy			Any

		<limitexcept GET PROPFIND OPTIONS REPORT>
			Require valid-user
		</limitexcept>
	
</virtualhost>

/etc/httpd/conf.d/ssl.conf is pretty much the unmodified distribution ssl.conf and looks like this:

LoadModule ssl_module modules/mod_ssl.so

Listen 443

AddType application/x-x509-ca-cert .crt
AddType application/x-pkcs7-crl    .crl

SSLPassPhraseDialog  builtin

SSLSessionCache         shmcb:/var/cache/mod_ssl/scache(512000)
SSLSessionCacheTimeout  300

SSLMutex default

SSLRandomSeed startup file:/dev/urandom  256
SSLRandomSeed connect builtin

SSLCryptoDevice builtin

SetEnvIf User-Agent ".*MSIE.*" \
         nokeepalive ssl-unclean-shutdown \
         downgrade-1.0 force-response-1.0

You’ll need to build yourself a certificate, self-signed if necessary, but that’s a whole other post. I recommend searching the web for “openssl self signed certificate” and you should find what you need. The above httpd.conf references the key and certificate under /etc/httpd/conf/svn.key and /etc/httpd/conf/svn.crt respectively.

The mod_authnz_external+pwauth combination can be avoided if you can persuade mod_authz_ldap to play nicely. There are a few different ldap modules around on the intertubes and after a lot of trial and even more error I couldn’t make any of them work reliably if at all.

And if all this leaves you feeling pretty nauseous it’s quite natural. To remedy this, go use git instead.

Thoughts on the WDTV Live Streaming Multimedia Player

A couple of weeks ago I had some Amazon credit to use and I picked up a Western Digital TV Live. I’ve been using it on and off since then and figured I’d jot down some thoughts.

Looks

Well how does it look? It’s small for starters, smaller than a double-CD case if you can remember those, around an inch deep. Probably a little larger than the Cyclone players although I don’t have any of those to compare with. It’s also very light indeed – not having a hard disk or power supply built in means the player itself can’t have much more than a motherboard in. I imagine the heaviest component is probably a power regulator heatsink or the case itself. It doesn’t sound like it has any fans in either which means there’s no audible running noise. I’ve wall-wart power bricks which make more running noise than this unit.

Mounting is performed using a couple of recesses on the back. I put a single screw into the VESA mount on the back of the kitchen TV and hung the WDTV from that. The infrared receiver seems pretty receptive just behind the top of the TV, facing upwards and the heaviest component to worry about is the HDMI or component AV cable – not a big deal at all.

Interface

The on-screen interface is pleasant and usable once you work your way around the icons and menus. The main screens – Music/Video/Services/Settings are easy enough but the functionality of the coloured menus isn’t too clear until you’ve either played around with them enough, or read the manual (haha). Associating to Wifi is a bit of a pain if you have a long WPA key as the soft keyboard isn’t too great. I did wonder if it’s possible to attach a USB keyboard just to enter passwords etc. but I didn’t try that out.

Connecting to NFS and SMB/CIFS shared drives is relatively easy. It helps if the shares are already configured to allow guest access or have a dedicated account for media players for example. The WDTV Live really wants read-write access for any shares you’re going to use permanently so it can generate its own indices. I like navigating folders and files rather than special device-specific libraries so I’m not particularly keen on this, but if it improves the multimedia experience so be it. I’ve enough multimedia devices in the house now, each with their own method of indexing that remembering which index folders from device A need to be ignored device B is becoming a bit of a nuisance. I haven’t had more than the usual set of problems with sending remote audio to the WDTV Live from a bunch of different Android devices, or using it as a Media Renderer from the DiskStation Audio Station app.

The remote control feels solid, with positive button actions and a responsive receiver. It’s laid out logically I guess, by which I mean it’s laid out in roughly the same way as most other video & multimedia remote controls I’ve used.

Firmware Updates

So normally I expect to buy some sort of gadget like this, use it for a couple of months, find a handful of bugs and never receive any firmware updates for it ever again. However I’ve been pleasantly surprised. In the two weeks I’ve had the WDTV I’ve had two firmware updates, one during the initial installation and the most recent in the last couple of days to address, amongst other things, slow frontend performance when background tasks are running (read “multimedia indexing on network shares” here). I briefly had a scan around the web to see if there was an XBMC port and there didn’t appear to be although there were some requests. I haven’t looked to see what CPU the WDTV has inside but it’s probably a low power ARM or Broadcom or similar so would take some effort to port XBMC to (from memory I seem to recall there is an ARM port in the works though). The regular firmware is downloadable and hackable however and there’s at least one unofficial version around.

Performance

Video playback has been smooth on everything I’ve tried. The videos I’ve played back have all been different formats, different container formats, different resolutions etc. and all streamed over 802.11G wifi and ethernet. I didn’t have any trouble with either type of networking so I haven’t checked to see whether the wired port is 100Mbps or 1GbE. I haven’t tried USB playback and there’s no SD card slot, which you might expect.

Audio playback is smooth although the interface took a little getting used to. I’ve been used to the XBMC and Synology DSAudio style of Queue/Play but this device always seems to queue+play which is actually what you want a lot of the time. I don’t have a digital audio receiver so I haven’t tried the SPDIF out.

Picture playback is acceptable but I found the transitions pretty jumpy, at least with 12 and 14Mpx images over wifi.

Conclusions

Overall I’m pretty happy with this device. It’s cheap, small, quiet and unobtrusive but packs a fair punch in terms of features. My biggest gripe is that it’s really slow doing its indexing. I thought the reason could have been because it was running over wifi but even after attaching it to a wired network it’s taken three days solid scanning our family snaps and home videos (a mix of still-camera video captures, miniDV transfers and HD camcorder). It doesn’t give you an idea of how far it’s progressed or how much is left to go so the only option seems to be to leave it and let it run. I did also have an initial problem where the WDTV didn’t detect it had HDMI plugged in, preferring to use the composite video out. Unscientifically, at the same time as I updated the firmware I reversed the cable so I don’t know quite what fixed it but it seems to have been fine since.

If I had to give an overall score for the WDTV Live, I’d probably say somewhere around 8/10.

 

Haplotype Consensus Clustering

Way back in the annals of history (2002) I wrote a bit of code to perform haplotype groupings for early Ensembl-linked data. Like my recent kmer scanner example, it used one of my favourite bits of Perl – the regex engine. I dug the old script out of an backup and it was, as you’d expect, completely horrible. So for fun I gave it a makeover this evening, in-between bits of Silent Witness.

This is what it looks like now. Down to 52 lines of code from 118 lines in the 2002 version. I guess the last 10 years have made me a little over twice as concise.

#!/usr/local/bin/perl -T
use strict;
use warnings;

#########
# read everything in
#
my $in = [];
while(my $line = <>) {
  chomp $line;
  if(!$line) {
    next;
  }

  #########
  # build regex pattern
  #
  $line =~ s{[X]}{.}smxg;

  #########
  # store
  #
  push @{$in}, uc $line;
}

my $consensii = {};

#########
# iterate over inputs
#
SEQ: for my $seq (sort { srt($a, $b) } @{$in}) {
  #########
  # iterate over consensus sequences so far
  #
  for my $con (sort { srt($a, $b) } keys %{$consensii}) {
    if($seq =~ /^$con$/smx ||
       $con =~ /^$seq$/smx) {
      #########
      # if input matches consensus, store & jump to next sequence
      #
      push @{$consensii->{$con}}, $seq;
      next SEQ;
    }
  }

  #########
  # if no match was found, create a new consensus container
  #
  $consensii->{$seq} = [$seq];
}

#########
# custom sort routine
# - firstly sort by sequence length
# - secondly sort by number of "."s (looseness)
#
sub srt {
  my ($x, $y) = @_;
  my $lx = length $x;
  my $ly = length $y;

  if($lx < $ly) {
    return -1;
  }
  if($ly > $lx) {
    return 1;
  }

  my $nx = $x =~ tr{.}{.};
  my $ny = $y =~ tr{.}{.};

  return $nx < => $ny;
}

#########
# tally and print everything out
#
while(my ($k, $v) = each %{$consensii}) {
  $k =~ s/[.]/X/sxmg;
  print $k, " [", (scalar @{$v}) ,"]\n";
  for my $m (@{$v}) {
    $m =~ s/[.]/X/sxmg;
    print "  $m\n";
  }
}

The input file looks something like this:

ACTGXTGC
ACTGATGC
ACTGTTGC
ACTGCTGC
ACTGGTGC
ACTXCTGC
ACXGCTGC
ACTGXTGC
CTGCTGC
CTGGTGC
CTXCTGC
CXGCTGC
CTGXTGC

ACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTG
ACTGXTGACTGACTG
ACTGACTGACTXACTG
ACXTGACTGACTGACTG

and the output looks a little like this – consensus [number of sequences] followed by an indented list of matching sequences:

elwood:~/dev rmp$ ./haplotype-sort < haplotype-in.txt 
ACTGXTGACTGACTG [1]
  ACTGXTGACTGACTG
ACTGATGC [1]
  ACTGATGC
CTGCTGC [4]
  CTGCTGC
  CTXCTGC
  CXGCTGC
  CTGXTGC
ACTGCTGC [5]
  ACTGCTGC
  ACTGXTGC
  ACTXCTGC
  ACXGCTGC
  ACTGXTGC
ACTGACTGACTGACTGACTG [2]
  ACTGACTGACTGACTGACTG
  ACTGACTGACTGACTGACTG
ACTGTTGC [1]
  ACTGTTGC
CTGGTGC [1]
  CTGGTGC
ACTGACTGACTXACTG [1]
  ACTGACTGACTXACTG
ACXTGACTGACTGACTG [1]
  ACXTGACTGACTGACTG
ACTGGTGC [1]
  ACTGGTGC

naïve kmer scanner

Another bit of fun, basically the opposite of yesterday’s post, here we’re detecting the number of unique kmers present in a sequence. It’s easy to do this with an iterating substr approach but I like Perl’s regex engine a lot so I wanted to do it using that. Okay, I wanted to do it entirely in one /e regex but it’s slightly trickier and a lot less clear manipulating pos inside a /e substitution function.

#!/usr/local/bin/perl
use strict;
use warnings;

my $str   = q[AAACAATAAGAAGCACCATCAGTACTATTAGGACGATGAGGCCCTCCGCTTCTGCGTCGGTTTGTGGG];
my $k     = 3;
my $match = q[\s*[ACTG]\s*]x$k;
my $seen  = {};

while($str =~ m{($match)}smxgi) {
  my $m = $1;
  $m    =~ s/\s*//smxg;

  $seen->{$m}++;

  pos $str = (pos $str) - $k + 1;
}

{
  local $, = "\n";
  print sort keys %{$seen};
}

printf "\n%d unique ${k}mers\n", scalar keys %{$seen};

$k is the size of the kmers we’re looking for. In this case 3, as we were generating yesterday.
$match attempts to take care of matches across newlines, roughly what one might find inside a FASTA. YMMV.
$seen keeps track of uniques we’ve encountered so far in $str.

The while loop iterates through matches found by the regex engine and pos, a function you don’t see too often, resets the start position for the next match, in this case to the current position minus 1 less than the length of the match (pos – k + 1).

The output looks something like this:


elwood:~/dev rmp$ ./kmers 
AAA
AAC
AAG
AAT
ACA
ACC
ACG
ACT
AGA
AGC
AGG
AGT
ATA
ATC
ATG
ATT
CAA
CAC
CAG
CAT
CCA
CCC
CCG
CCT
CGA
CGC
CGG
CGT
CTA
CTC
CTG
CTT
GAA
GAC
GAG
GAT
GCA
GCC
GCG
GCT
GGA
GGC
GGG
GGT
GTA
GTC
GTG
GTT
TAA
TAC
TAG
TAT
TCA
TCC
TCG
TCT
TGA
TGC
TGG
TGT
TTA
TTC
TTG
TTT
64 unique 3mers

If I were really keen I’d make use this in a regression test for yesterday’s toy.

naïve kmer sequence generator

This evening, for “fun”, I was tinkering with a couple of methods for generating sequences containing diverse, distinct, kmer subsequences. Here’s a small, unintelligent, brute-force function I came up with.
Its alphabet is set at the top in $bases, as is k, the required length of the distinct subsequences. It keeps going until it’s been able to hit all distinct combinations, tracked in the $seen hash. The final sequence ends up in $str.

#!/usr/local/bin/perl
use strict;
use warnings;

my $bases     = [qw(A C T G)];
my $k         = 3;
my $seen      = {};
my $str       = q[];
my $max_perms = (scalar @{$bases})**$k;
my $pos       = -1;

POS: while((scalar keys %{$seen}) < $max_perms) {
  $pos ++;
  report();

  for my $base (@{$bases}) {
    my $triple = sprintf q[%s%s],
                 (substr $str, $pos-($k-1), ($k-1)),
		 $base;
    if($pos < ($k-1) ||
       !$seen->{$triple}++) {
      $str .= $base;
      next POS;
    }
  }
  $str .= $bases->[-1];
}

sub report {
  print "len=@{[length $str]} seen @{[scalar keys %{$seen}]}/$max_perms kmers\n";
}

report();
print $str, "\n";

Executing for k=3, bases = ACTG the output looks like this:

elwood:~/dev rmp$ ./seqgen
len=0 seen 0/64 kmers
len=1 seen 0/64 kmers
len=2 seen 0/64 kmers
len=3 seen 1/64 kmers
len=4 seen 2/64 kmers
len=5 seen 3/64 kmers
len=6 seen 4/64 kmers
len=7 seen 5/64 kmers
len=8 seen 6/64 kmers
len=9 seen 7/64 kmers
len=10 seen 8/64 kmers
len=11 seen 9/64 kmers
len=12 seen 10/64 kmers
len=13 seen 10/64 kmers
len=14 seen 11/64 kmers
len=15 seen 12/64 kmers
len=16 seen 13/64 kmers
len=17 seen 14/64 kmers
len=18 seen 15/64 kmers
len=19 seen 16/64 kmers
len=20 seen 17/64 kmers
len=21 seen 18/64 kmers
len=22 seen 19/64 kmers
len=23 seen 20/64 kmers
len=24 seen 21/64 kmers
len=25 seen 22/64 kmers
len=26 seen 23/64 kmers
len=27 seen 24/64 kmers
len=28 seen 25/64 kmers
len=29 seen 26/64 kmers
len=30 seen 27/64 kmers
len=31 seen 28/64 kmers
len=32 seen 29/64 kmers
len=33 seen 30/64 kmers
len=34 seen 31/64 kmers
len=35 seen 32/64 kmers
len=36 seen 33/64 kmers
len=37 seen 34/64 kmers
len=38 seen 35/64 kmers
len=39 seen 36/64 kmers
len=40 seen 37/64 kmers
len=41 seen 37/64 kmers
len=42 seen 38/64 kmers
len=43 seen 39/64 kmers
len=44 seen 40/64 kmers
len=45 seen 41/64 kmers
len=46 seen 42/64 kmers
len=47 seen 43/64 kmers
len=48 seen 44/64 kmers
len=49 seen 45/64 kmers
len=50 seen 46/64 kmers
len=51 seen 47/64 kmers
len=52 seen 48/64 kmers
len=53 seen 49/64 kmers
len=54 seen 50/64 kmers
len=55 seen 51/64 kmers
len=56 seen 52/64 kmers
len=57 seen 53/64 kmers
len=58 seen 54/64 kmers
len=59 seen 55/64 kmers
len=60 seen 56/64 kmers
len=61 seen 57/64 kmers
len=62 seen 58/64 kmers
len=63 seen 59/64 kmers
len=64 seen 60/64 kmers
len=65 seen 61/64 kmers
len=66 seen 62/64 kmers
len=67 seen 63/64 kmers
len=68 seen 64/64 kmers
AAACAATAAGAAGCACCATCAGTACTATTAGGACGATGAGGCCCTCCGCTTCTGCGTCGGTTTGTGGG

As you can see, it manages to fit 64 distinct base triples in a string of only 68 characters. It could probably be packed a little more efficiently but I don’t think that’s too bad for a first attempt.

Bookmarks for February 8th through April 23rd

These are my links for February 8th through April 23rd:

Bookmarks for December 16th through January 11th

These are my links for December 16th through January 11th: