Sunday, June 06, 2010

Powershell commandlet to select a random subset from input stream

The problem:

Given a stream of incoming objects, we want to select a random subset of size "count".

This has many applications - for example, in my work as a software tester I could use this to do something like

cat alltests.txt |
Select-Tests-Meeting-Some-Criteria |
Modify-Tests |
Select-Random -Count 100 |
Run-Tests

This pipeline modifies all tests meeting some criteria and then selects a random subset of just 100 tests out of the (say) millions of modified tests in my test repository and executes those tests.

The obvious brute force solution would be to save all incoming objects into an array, shuffle the array and select the top "count" objects from the array. But this has a number of problems including:
  • Storage requirement of O(N) where "N" is the size of the input
  • All the processing starts only after all input has been read in so that there is a computational peak right at the end of input and no computation happens while reading the input
We can instead adopt a different approach that involves iteratively discarding input as it streams in and retaining only "count" objects at any given point of time. Of course, while doing so we have to ensure that all input objects have an equal chance of being selected. To ensure that we do this correctly, we will need to bias our selection to favor objects that have already survived previous eliminations rather than the new incoming object that has not yet been subjected to previous eliminations. This is easy to achieve if we maintain the invariant that the nth. input object should have a probability of survival equal to count/n for n>= count.

Here is a Powershell commandlet that encodes this logic:




function Select-Random {
param ($count) ;
begin {
if($count -le 0) {
throw [System.Exception]
}
$num_seen = 0;
$rand_generator = New-Object system.random;
$selected = @();
}
process {
if($num_seen++ -lt $count){
$selected += $_;
}
else {
#Get a random number between 0 and ($num_seen -1)
$random = $rand_generator.Next(0,$num_seen);
if($random -lt $count) { #Choose the new number and add to $selected
$selected[$random] = $_ ;
}
}
}
end {
$selected;
}
}




And here is a test to make sure that our algorithm works correctly - our input consists of the integers from 1 to 11 and we select 10 of these. In other words, the probability of selection for any number is 10/11 = 0.9091. We repeat this test 10000 times and check the frequency of occurrence of each number in the output - it should hover around 0.9091 * 10000 = 9091

for ($i = 0 ; $i -lt 10000 ; ++$i) {
$list += ((1..11) | Select-Random -count 10)
}
$list | Group-Object | Sort-Object -Property Count

PS: 44 >.\select-random.ps1

Count Name Group
----- ---- -----
9051 4 {4, 4, 4, 4...}
9056 7 {7, 7, 7, 7...}
9070 1 {1, 1, 1, 1...}
9079 2 {2, 2, 2, 2...}
9083 8 {8, 8, 8, 8...}
9088 10 {10, 10, 10, 10...}
9094 9 {9, 9, 9, 9...}
9094 6 {6, 6, 6, 6...}
9099 3 {3, 3, 3, 3...}
9133 11 {11, 11, 11, 11...}
9153 5 {5, 5, 5, 5...}

Some things I learnt from this exercise:
  • How to initialize an empty Powershell array
  • How to add an element to the end of a Powershell array
  • The Group-Object commandlet

No comments: