Sunday, June 06, 2010

Powershell commandlet to select a random subset from input stream

The problem:

Given a stream of incoming objects, we want to select a random subset of size "count".

This has many applications - for example, in my work as a software tester I could use this to do something like

cat alltests.txt |
Select-Tests-Meeting-Some-Criteria |
Modify-Tests |
Select-Random -Count 100 |
Run-Tests

This pipeline modifies all tests meeting some criteria and then selects a random subset of just 100 tests out of the (say) millions of modified tests in my test repository and executes those tests.

The obvious brute force solution would be to save all incoming objects into an array, shuffle the array and select the top "count" objects from the array. But this has a number of problems including:
  • Storage requirement of O(N) where "N" is the size of the input
  • All the processing starts only after all input has been read in so that there is a computational peak right at the end of input and no computation happens while reading the input
We can instead adopt a different approach that involves iteratively discarding input as it streams in and retaining only "count" objects at any given point of time. Of course, while doing so we have to ensure that all input objects have an equal chance of being selected. To ensure that we do this correctly, we will need to bias our selection to favor objects that have already survived previous eliminations rather than the new incoming object that has not yet been subjected to previous eliminations. This is easy to achieve if we maintain the invariant that the nth. input object should have a probability of survival equal to count/n for n>= count.

Here is a Powershell commandlet that encodes this logic:




function Select-Random {
param ($count) ;
begin {
if($count -le 0) {
throw [System.Exception]
}
$num_seen = 0;
$rand_generator = New-Object system.random;
$selected = @();
}
process {
if($num_seen++ -lt $count){
$selected += $_;
}
else {
#Get a random number between 0 and ($num_seen -1)
$random = $rand_generator.Next(0,$num_seen);
if($random -lt $count) { #Choose the new number and add to $selected
$selected[$random] = $_ ;
}
}
}
end {
$selected;
}
}




And here is a test to make sure that our algorithm works correctly - our input consists of the integers from 1 to 11 and we select 10 of these. In other words, the probability of selection for any number is 10/11 = 0.9091. We repeat this test 10000 times and check the frequency of occurrence of each number in the output - it should hover around 0.9091 * 10000 = 9091

for ($i = 0 ; $i -lt 10000 ; ++$i) {
$list += ((1..11) | Select-Random -count 10)
}
$list | Group-Object | Sort-Object -Property Count

PS: 44 >.\select-random.ps1

Count Name Group
----- ---- -----
9051 4 {4, 4, 4, 4...}
9056 7 {7, 7, 7, 7...}
9070 1 {1, 1, 1, 1...}
9079 2 {2, 2, 2, 2...}
9083 8 {8, 8, 8, 8...}
9088 10 {10, 10, 10, 10...}
9094 9 {9, 9, 9, 9...}
9094 6 {6, 6, 6, 6...}
9099 3 {3, 3, 3, 3...}
9133 11 {11, 11, 11, 11...}
9153 5 {5, 5, 5, 5...}

Some things I learnt from this exercise:
  • How to initialize an empty Powershell array
  • How to add an element to the end of a Powershell array
  • The Group-Object commandlet

Saturday, June 05, 2010

Importing mh mail into gmail

From 1990 until approx. 2004 I used the UNIX based mail client "mh". Around 2004 I got an invitation from my friend who worked at Google to try out gmail.

I had never used any web-based email service for a number of reasons:
  • I had assiduously collected all my personal email across a period of more than fifteen years and wanted the security of knowing that all my email would always be available to me even if the "free" email provider were to close down
  • As an mh user I was accustomed to the power of being able to do stuff such as "scan +inbox | wc -l" and no web based email would allow me to do that
  • I did not care to read email accompanied by flashing advertisements etc.
The downside of my intransigence of course was that I could not take advantage of the universal access to my email that web based email providers offered besides having to ensure secure storage all my email myself.

gmail was a refreshing change however since it was the first email provider to also allow pop-based access to email stored there. With gmail I now had the choice of accessing my email anytime I had internet access while still permitting me to retain control over my email since I could always download it onto my local machine and I opened an email account (which was then "invitation only" :-).

The reality turned out to be different however. While it was true that I could download my mail in gmail using POP that had its own problems since all my emails got downloaded (including once I had already downloaded) every time I fetched mail from my gmail server. Furthermore gmail had added a number of cool features such as keyboard shortcuts, tagging, searching etc. that I was fast getting addicted to. Also I had taken up a job that involved working in a Windows environment so that it was no longer very feasible for me to use my Linux laptop at work. Before I knew it, I had already accumulated a substantial amount of mail in my gmail account in addition to my substantial mail on my home machine. Furthermore with the evolution of email I have to regretfully acknowledge that mh (or its graphical successor exmh) no longer cuts it and I have accepted that I have to move on from my past love ...

I am now trying to import my old mail from mh into gmail and here is how I was able to successfully achieve this:
  • download the python script "gml" from http://marklyon.org/gmail/old/gml.tar using "wget http://marklyon.org/gmail/old/gml.tar"
  • Extract the file gml.py - "tar xvf gml.tar"
I then wrote a bash script to upload all my old email to gmail and here is what it looks like:


folders -noverbose -all -recurse -fast -noheader -nototal | while read folder
do
#Remove the old mbox file
/bin/rm ./msgbox
#Pack this folder into an mbox format file
packf +$folder
#Mail the contents of this file one mail at a time while preserving
#original headers and simultaneously add folder-related information
#for use in gmail
python gml.py mbox ./msgbox address+$folder@gmail.com outgoing_smtp_server_address
done


Just in case you were not aware of this, gmail allows you to create multiple aliases for the same email address e.g. (messages to foo@gmail.com, foo+bar@gmail.com, foo+baz@gmail.com are all delivered to foo@gmail.com) - see http://mail.google.com/support/bin/answer.py?hl=en&answer=12096 for more details.

As a result I can now (theoretically) transfer all my old email to my gmail account. I will now need to figure out how to automatically label them based on the old source folder name (which I am embedding into my "To:" address field. I have not yet figured out how I can do that. Watch this space ...