Sampling on a aggregated dataset

56 views Asked by At

Input is a dataset where every row contains for an event, say click. The member ID is a unique ID. sample data: M1,100 M2,100 M3,50 M4,50 The goal is to sample 1% of the clicks, where total clicks are given by summing up all clicks across all member IDs. If I wish to sample 1% on the sample dataset, I wish to apply a technique that samples the click counts randomly and produces 1% or 3 clicks, something like: M1, 1 M2, 1 M4, 1 or some other combination where the sum of clicks across members is 1%.

One basic approach is to explode all entries in the input and have as the data, then sample 1% from it. This would be very slow/inefficient if there are millions of members with 100s of click counts. Looking for a better solution where no explosion of data is needed?

1

There are 1 answers

2
Robert Dodier On BEST ANSWER

It seems like the obvious thing to do is to sample from users, with probability of each user proportional to the number of clicks for them, and then select a click uniformly at random for the given user. In the example you gave, that means select users with probabilities 100/300, 100/300, 50/300, and 50/300, and then select a click from the given user.

You can sample proportional to weights (100/300, 100/300, 50/300, 50/300 here) by generating a random number p between 0 and 1 and then finding the smallest k (k = 1, 2, 3, ... #weights) such that the sum of the weights from 1 to k is less than or equal to p.

An efficient way to find k is construct a list of the partial sums of weights (i.e. 0, w1, w1 + w2, w1 + w2 + w3, ...) and then carry out a binary search (not linear) on that list. A binary search will yield time per sample which grows logarithmically with the number of weights (users in your case), while linear search yields linear growth.

EDIT: An example. Given n = 10 users with N = (100, 160, 200, 20, 500, 550, 400, 300, 120, 80) events, respectively. Total events = 2430, and weights w = (10/243, 16/243, 20/243, 2/243, 50/243, 55/243, 40/243, 10/81, 4/81, 8/243). Partial sums of weights S = (0, 10/243, 26/243, 46/243, 16/81, 98/243, 17/27, 193/243, 223/243, 235/243, 1). (NOTE: I was mistaken before; the sequence should be (0, w1, w1 + w2, w1 + w2 + w3, ..., w1 + ... + w[n - 1], 1).)

Given a random number x between 0 and 1, find (by binary search) the index of the partial sum such that S[i] <= x < S[i + 1]. Then select an event uniformly at random from the N[i] events for user i.

I assume that you can carry out the binary search and the sampling from the per-user events so I won't write out that part.

EDIT2: Fixed up formula for list of partial sums. The list has n + 1 elements; searching for i such that S[i] <= x < S[i + 1] will therefore yield i = 1, 2, 3, ..., n. The final element, 1, won't ever be selected, assuming the random number is always less than 1.