Becoming Sanjuro

Fail again. Fail better.

A Note on Reservoir Sampling

Introduction

Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of k items from a list S containing n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn’t fit into main memory.

Algorithm

  1. Create an array of size k, and keep the first k items from the list in it.

  2. For the i-th item (where k < i <= n):

    • Generate a random integer j from 1 to i.

    • If 1 <= j <= k, replace the j-th item in the array with the new item.

    • Otherwise, ignore the new item.

Proof of the correctness by induction:

  1. Base case: when n = k, all items are in the array, which is correct.

  2. Inductive step:

    • Assume that when n = i (where i > k), each item is kept in the array with probability of k / i.

    • Then when n = i + 1, the new item would be kept in the array with probability of k / (i + 1).

      Meanwhile, for the first i items, each of them would be kept in the array with probability of (k / i) (1 - (k / (i + 1)) (1 / k)), i.e. k / (i + 1).

      Note that:

      k / i is the probability that one of the first i items was kept in the array before the current step according to the assumption.

      k / (i + 1) is the probability that the new item would be kept in the array.

      1 / k is the probability of choosing an item from the array.

      1 - (k / (i + 1)) * (1 / k) is the probability that an item in the array would not be replaced.