Introduction
Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of k items from a list S containing n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn’t fit into main memory.
Algorithm
Create an array of size k, and keep the first k items from the list in it.
For the i-th item (where k < i <= n):
Generate a random integer j from 1 to i.
If 1 <= j <= k, replace the j-th item in the array with the new item.
Otherwise, ignore the new item.
Proof of the correctness by induction:
Base case: when n = k, all items are in the array, which is correct.
Inductive step:
Assume that when n = i (where i > k), each item is kept in the array with probability of k / i.
Then when n = i + 1, the new item would be kept in the array with probability of k / (i + 1).
Meanwhile, for the first i items, each of them would be kept in the array with probability of (k / i) (1 - (k / (i + 1)) (1 / k)), i.e. k / (i + 1).
Note that:
k / i is the probability that one of the first i items was kept in the array before the current step according to the assumption.
k / (i + 1) is the probability that the new item would be kept in the array.
1 / k is the probability of choosing an item from the array.
1 - (k / (i + 1)) * (1 / k) is the probability that an item in the array would not be replaced.