Given a stream of incoming objects, we want to select a random subset of size "count".
This has many applications - for example, in my work as a software tester I could use this to do something like
cat alltests.txt |
Select-Tests-Meeting-Some-Criteria |
Modify-Tests |
Select-Random -Count 100 |
Run-Tests
This pipeline modifies all tests meeting some criteria and then selects a random subset of just 100 tests out of the (say) millions of modified tests in my test repository and executes those tests.
The obvious brute force solution would be to save all incoming objects into an array, shuffle the array and select the top "count" objects from the array. But this has a number of problems including:
- Storage requirement of O(N) where "N" is the size of the input
- All the processing starts only after all input has been read in so that there is a computational peak right at the end of input and no computation happens while reading the input
Here is a Powershell commandlet that encodes this logic:
function Select-Random {
param ($count) ;
begin {
if($count -le 0) {
throw [System.Exception]
}
$num_seen = 0;
$rand_generator = New-Object system.random;
$selected = @();
}
process {
if($num_seen++ -lt $count){
$selected += $_;
}
else {
#Get a random number between 0 and ($num_seen -1)
$random = $rand_generator.Next(0,$num_seen);
if($random -lt $count) { #Choose the new number and add to $selected
$selected[$random] = $_ ;
}
}
}
end {
$selected;
}
}
And here is a test to make sure that our algorithm works correctly - our input consists of the integers from 1 to 11 and we select 10 of these. In other words, the probability of selection for any number is 10/11 = 0.9091. We repeat this test 10000 times and check the frequency of occurrence of each number in the output - it should hover around 0.9091 * 10000 = 9091
for ($i = 0 ; $i -lt 10000 ; ++$i) {
$list += ((1..11) | Select-Random -count 10)
}
$list | Group-Object | Sort-Object -Property Count
PS: 44 >.\select-random.ps1
Count Name Group
----- ---- -----
9051 4 {4, 4, 4, 4...}
9056 7 {7, 7, 7, 7...}
9070 1 {1, 1, 1, 1...}
9079 2 {2, 2, 2, 2...}
9083 8 {8, 8, 8, 8...}
9088 10 {10, 10, 10, 10...}
9094 9 {9, 9, 9, 9...}
9094 6 {6, 6, 6, 6...}
9099 3 {3, 3, 3, 3...}
9133 11 {11, 11, 11, 11...}
9153 5 {5, 5, 5, 5...}
Some things I learnt from this exercise:
- How to initialize an empty Powershell array
- How to add an element to the end of a Powershell array
- The Group-Object commandlet