Taking Random Samples

29 Mar

Reblogged from https://eyalsch.wordpress.com/2010/04/01/random-sample/

Sometimes we need to take a random sample from a given collection of objects. This is useful in simulations, games, and statistical applications.  In this post I discuss some efficient algorithms for choosing m items out of n available, such that the m items are chosen at random, with equal probability for all possible subsets of size m. For each algorithm I present the code in Java, as well as a formal analysis of its performance and correctness, where needed. As you may have already noticed from the introduction, this post is more theoretic than usual…

A somehow related problem is the problem of generating a random permutation of given collection. While Java provides an efficient utility method for that purpose (See Collections.shuffle(..), based on the Knuth Shuffle algorithm), there is no built-in utility for generating random subsets.

Before we start discussing some known algorithms, it is important to note that there are many variations of the problem, depending  on the following parameters:

  1. Is the collection random access (e.g. ArrayList)?
  2. Is the collection read only?
  3. Do we allow random running time?
  4. Is N known at invocation time, or are we dealing with a stream of items of unknown size?

Trial and Error

We will start with the simplest approach. Assuming  that the collection to take the sample from is random access, we can repeatedly add random items from the collection to our result set, until the set contains m different items. As an optimization, when m > n/2, we can choose (n-m) items instead of m, and then return the rest:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public static <T> Set<T> randomSample1(List<T> items, int m){
    HashSet<T> res = new HashSet<T>();
    int n = items.size();
    if (m > n/2){ // The optimization
        Set<T> negativeSet = randomSample1(items, n-m);
        for(T item : items){
            if (!negativeSet.contains(item))
                res.add(item);
        }
    }else{ // The main loop
        while(res.size() < m){
            int randPos = rnd.nextInt(n);
            res.add(items.get(randPos));
        }
    }
    return res;
}

Clearly, the number of iterations in the main loop is not bounded. However, we can compute the expected number of iterations, what makes this algorithm a Las Vegas algorithm. The iterations can be split into m different sequences (numbered 0 to m-1), where sequence i refers to the attempts needed for adding the (i+1)-th item into the result set. By observing that the length of sequence i has a geometric distribution defined by p=(n-i)/n, we can calculate the expected number of iterations as follows:

E(m,n) = \frac{n}{n} + \frac{n}{n-1} + \frac{n}{n-2} + ... + \frac{n}{n-m+1}
And lets also define E(0,n)=0

If we examine E(m,n) as a function of m only, in the interval m=0 to m=\lfloor\frac{n}{2}\rfloor, it can be easily verified that the function is increasing and convex. Therefore we have:
E(m,n) \leq \frac{E(\lfloor\frac{n}{2}\rfloor,n)}{\lfloor\frac{n}{2}\rfloor} \cdot m=\frac{n\cdot(\frac{1}{n} + \frac{1}{n-1} + ... + \frac{1}{n-\lfloor\frac{n}{2}\rfloor+1})}{\lfloor\frac{n}{2}\rfloor} \cdot m \leq \frac{n\cdot\lfloor\frac{n}{2}\rfloor\cdot\frac{1}{\lfloor\frac{n}{2}\rfloor}}{\lfloor\frac{n}{2}\rfloor} \cdot m \leq 3m

In other words the expected time complexity is linear in m, which is optimal. This is a little surprising, given the naive nature of the algorithm, and it results from our optimization.

Swapping

If our collection is random access and its items can be freely reordered, then we can efficiently draw random items one by one from a candidates set, containing only items not chosen so far. By swapping items, we can guarantee that the candidates set is always contiguous . As a matter of fact, this algorithm is a bounded version of the Knuth Shuffle algorithm for generating random permutations, and its correctness is trivial.

1
2
3
4
5
6
7
8
9
public static <T> List<T> randomSample2(List<T> items, int m){
    for(int i=0;i<m;i++){
        int pos = i + rnd.nextInt(items.size() - i);
        T tmp = items.get(pos);
        items.set(pos, items.get(i));
        items.set(i, tmp);
    }
    return items.subList(0, m);
}

Full Scan

Sometimes our collection is not random access, so we have no choice but to traverse it completely in the worst case. Following is an elegant solution, that iterates once through the items, and selects an item with probability (#remaining to select)/(#remaining to scan):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public static <T> List<T> randomSample3(List<T> items, int m){
    ArrayList<T> res = new ArrayList<T>(m);
    int visited = 0;
    Iterator<T> it = items.iterator();
    while (m > 0){
        T item = it.next();
        if (rnd.nextDouble() < ((double)m)/(items.size() - visited)){
            res.add(item);
            m--;
        }
        visited++;
    }
    return res;
}

Clearly, the running time is O(n), which is optimal given the constraints.
It is left to prove that for any collection C and any subset S, the chances of generating S are {1 \over {|C| \choose |S|}}. We can do it by induction on the size of C.
For |C|=1 and |S| between 0 and 1 this is trivial. Now, let C be an ordered collection and let S be a subset of C. When the algorithm starts traversing C, it first encounters item v. If v \in S, we should choose it and then proceed by choosing items S-{v} from the collection C-{v}. We can use our induction assumption to calculate the probability of this:
p(C,S)={|S|\over|C|}\cdot{1\over{|C|-1\choose |S|-1}}={1\over{|C|\choose |S|}}

If on the other hand v \notin S, then we should reject v and proceed by choosing items S from the collection C-{v}. This happens with probability:
p(C,S)={|C|-|S|\over|C|}\cdot{1\over{|C|-1\choose |S|}}={1\over{|C|\choose |S|}}
In either cases we get the required probability.

Floyd’s Algorithm

What happens if we don’t want the time complexity to be dependent on N, and the collection is read only?
In this case, assuming the collection is random access, we can use Floyd’s algorithm, which is both brilliant and easy to implement. It iterates with variable i through the last m indexes of the collection, and on each iteration adds a single item from the range 0..i, with a non-uniform distribution:

1
2
3
4
5
6
7
8
9
10
11
12
13
public static <T> Set<T> randomSample4(List<T> items, int m){
    HashSet<T> res = new HashSet<T>(m);
    int n = items.size();
    for(int i=n-m;i<n;i++){
        int pos = rnd.nextInt(i+1);
        T item = items.get(pos);
        if (res.contains(item))
            res.add(items.get(i));
        else
            res.add(item);
    }
    return res;
}

The time complexity is O(m) on the average, but we can bound the worst case time by O(m log(m)) if we use a balanced tree instead of a hash set.
The correctness follows by proving the following loop invariant: After completing an iteration with i=j, the set res contains a uniformly distributed random subset of size j-n+m+1 of the items in the range 0..j. We will prove this by induction on i. For the initial value of i (i=n-m), we simply choose a random position in the range 0..i, so it is also a random subset of size 1 of the same range. Now, let i=j (>n-m), and let S be any subset of size  j-n+m+1 from the range 0..j. If items[j] is in S, then the previous iterations must have completed with res=S-{items[j]}, and then either position j or any previously selected position should be chosen. Using the induction assumption, we can compute the probability of obtaining S as follows:

p_1 = {1 \over {j\choose |S|-1}}\cdot{|S|\over j+1}={1 \over {j+1 \choose |S|}}

If on the other hand items[j] is not in S, then we have many options of selecting |S|-1 items in the previous iterations, and then we should choose the remaining index:

p_2 = {|S| \choose |S|-1}{1 \over {j\choose |S|-1}}\cdot{1\over j+1}={1 \over {j+1 \choose |S|}}

In both cases we have a uniform probability for choosing |S| items from j+1 candidates, and that completes the proof.

Stream of items

Sometimes we don’t know the collection size in advance. We can only iterate through it, as if it was a data stream with unknown size. The following algorithm (Known as Reservoir sampling) performs only one pass on the input stream. While iterating, it maintains a set of items that represents a random subset (of size m) of the items visited so far. It starts by selecting the first m items, and then it selects the k-th item with probability m/k. If the item is selected, it replaces a randomly chosen item from the previously selected ones:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public static <T> List<T> reservoirSample(Iterable<T> items, int m){
    ArrayList<T> res = new ArrayList<T>(m);
    int count = 0;
    for(T item : items){
        count++;
        if (count <= m)
            res.add(item);
        else{
            int r = rnd.nextInt(count);
            if (r < m)
                res.set(r, item);
        }
    }
    return res;
}

Each iteration consumes constant time, therefore the algorithm runs in O(n) time, where n is the stream length.
The correctness can be proved by induction on the stream size. We want to prove that for every stream T and every value of m (m<=|T|), all subsets of T of size m are equally likely to be returned. The base case is a stream of size m. In this case we have only one possible subset, and the algorithm always returns it. Now assume that we have a given stream T, and we know that the induction property holds for stream lengths below |T|. Let S be a subset of T. We shall inspect the last item v in the stream.
If v is not in S, then we must have chosen all items of S by the time we completed the previous iteration. We should also reject the last item:
p_1 = {1 \over {|T|-1 \choose |S|}}\cdot{|T|-|S|\over |T|}={1 \over {|T|\choose|S|}}

If on the other hand v is in S, then we should have the other |S|-1 items of the sample in the previous iteration already (plus an extra item among the |T|-|S| possible ones), and then we should choose v and make it replace the extra item:
p_2 = {|T|-|S| \over {|T| \choose |S|}}\cdot{|S|\over |T|}\cdot{1\over |S|}={1 \over {|T|\choose|S|}}

In both cases we get the required probability.

A note about random number generators

The algorithms above assume that the random number generator has a pure random behavior, but this is rarely the case. The class java.util.Random uses a very popular pseudo number generation approach, called Linear Congruential Generator.  Due to the fact that every generated number is determined by the previously generated one (or initially the seed), there is a bounded number of sequences that can be produced. Java’s Random class uses a state record of 48 bits, so we can never exceed 2^{48} different sequences. This has a huge impact when generating random constructs from a large combinatorial space, such as in the case of subsets or permutations. For example, if we are interested in generating a random subset of size 20 from a set of size 60, we have {60 \choose 20} options, which exceeds 2^{48}. Therefore, the results of any algorithm that uses java.lang.Random would be completely biased because there are many possible subsets that will never be returned. Actually, we will cover only 7% of the subsets space! For many applications this behavior is good enough. For others, which require more accuracy, we should consider a different random source than java.util.Random. After searching the web a little, I found the RngPack library, which seems to have some powerful random number generator implementations.

Sources

http://delivery.acm.org/10.1145/320000/315746/p754-bentley.pdf?key1=315746&key2=0487028621&coll=GUIDE&dl=GUIDE&CFID=79451466&CFTOKEN=34515112 http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle http://en.wikipedia.org/wiki/Linear_congruential_generatorhttp://comjnl.oxfordjournals.org/cgi/reprint/25/1/45.pdf http://comjnl.oxfordjournals.org/cgi/reprint/28/4/412.pdf http://www.jstor.org/stable/pdfplus/2347297.pdf http://stackoverflow.com/questions/48087/select-a-random-n-elements-from-listt-in-c# http://stackoverflow.com/questions/136474/best-way-to-pick-a-random-subset-from-a-collection

Leave a comment