How to shuffle array of items but allow weights to influence the order

386 views Asked by At

I'm trying to write a TypeScript function to shuffle an array.

By default, I want the shuffle order to be random (but subject to a seed). (I already have access to this function: function random(seed: number): number)

However, I want to also allow influencing the order via weights per item.

In other words, I want the the default item weight to be 1, and if an item has a weight of 10, it should be 10 times more likely to appear sooner in the shuffled order.

Am I even thinking about this correctly? Is this a reasonable goal?

I was thinking that I'd need to use the Fisher-Yates algorithm but adapted to honor a weights array of the same length as the main array, and the main array will be shuffled such that higher weighted items are more likely to appear first.

function removeDuplicates<T>(array: T[]): T[] {
  const uniqueValues = new Set<T>();
  return array.filter((item) => {
    if (!uniqueValues.has(item)) {
      uniqueValues.add(item);
      return true;
    }

    return false;
  });
}

function duplicateItemsBasedOnWeights<T>(array: T[], weights: number[]): T[] {
  const result = [];
  for (const [index, element] of array.entries()) {
    for (let position = 0; position < weights[index]; position++) {
      result.push(element);
    }
  }

  return result;
}

export function shuffleWithWeights<T>(array: T[], weights: number[], seed: number): T[] {
  const arrayWithDuplicateValuesBasedOnWeights: T[] = duplicateItemsBasedOnWeights(array, weights);

  const shuffledArrayWithDuplicateValuesBasedOnWeights = shuffleArrayUsingFisherYates(arrayWithDuplicateValuesBasedOnWeights, seed);

  return removeDuplicates(shuffledArrayWithDuplicateValuesBasedOnWeights);
}

I've looked at empirical results by calling it a bunch of different times with these values (and a different seed each time), and the results don't seem distributed how I'd hoped, so I must have been approaching this problem incorrectly.

const items = [1, 2, 3, 4, 5];
const weights = [1, 1, 1, 200, 1_000];

In my real-world cases, I'll be shuffling 70,000 objects (which would explore to many more than that if I use my current approach of creating duplicate items based on item weight).

3

There are 3 answers

7
jcalz On BEST ANSWER

I will assume that the objects in your arrays will have a numeric weight property which you can use to determine the weight, and a value property to hold the data you care about. So the array is of type Array<{value: unknown, weight: number}>. I am also just going to use Math.random() to generate a uniformly chosen random number between 0 (inclusive) and 1 (exclusive). If you have objects in a different format, or a custom random number generator that takes a seed, you can adjust the answer below to accommodate that. I consider these out of scope here, especially since your random(seed) function isn't available for others to use and isn't specified enough for an answer to use it (e.g., is it uniform between 0 and 1 like Math.random()? If you call random() with the same seed twice do you get two different answers or does the seed need to evolve also? etc).

Also note, the implementation below does not necessarily have optimal time complexity. It is O(n2) because weightedIndexChoice() is O(n) and weightedShuffle() calls it n times. If optimal time complexity is important there are apparently other solutions which will do so in O(n log n), which is better. The other answer below shows how to do it in python, and presumably someone can come up with a JS/TS implementation and post that here.


The Fisher-Yates shuffle is basically just building up a new array by randomly picking (and removing) elements from the first array, and pushing them onto the new array. There are various ways to implement that. The following does it by walking from the start to the end of the array and swapping a random element from later in the array to the current position:

function weightedShuffle(arr: { value: unknown, weight: number }[]) {
    for (let i = 0; i < arr.length; i++) {
        const v = weightedIndexChoice(arr.slice(i));
        [arr[i + v], arr[i]] = [arr[i], arr[i + v]];
    }
}

The important part of the above for your question is weightedIndexChoice(), which needs to randomly select an index of an array, weighted by the weight. Note that since you say you want more heavily weighted elements to be more likely to appear at the start of the array, that means we need to put the first randomly selected element at the start of the array. Some implementations of Fisher-Yates do it from the end of the array, and for uniformly random selections it doesn't matter. But if we did that without changing the weights it would end up putting more heavily weighted elements at the end, which isn't what you want.

There are definitely existing Stack Overflow question/answers covering how to implement weightedIndexChoice(). For example, How to choose a weighted random array element in Javascript?. Here's one way:

function weightedIndexChoice(arr: { value: unknown, weight: number }[]): number {
    const totalWeight = arr.map(v => v.weight).reduce((x, y) => x + y);
    const val = Math.random() * totalWeight;
    for (let i = 0, cur = 0; ; i++) {
        cur += arr[i].weight;
        if (val <= cur) return i;
    }
}

Essentially you choose a random number uniformly between 0 and the total of the weights. Then you figure out which element index corresponds to that number by taking the cumulative sum of the weights of the elements until you pass the random number. As a simple example, let's imagine you have three elements: [{value: "a", weight: 1}, {value: "b", weight: 2}, {value: "c", weight: 3}]. The total weight is 6. So you pick a random number between 0 (inclusive) and 6 (exclusive). The cumulative sum of the weights are 1 for "a"; 1+2=3 for "b"; and 1+2+3=6 for "c". So if your random number is between 0 and 1 you choose "a", if it's between 1 and 3 you choose "b", and if it's between 3 and 6 you choose "c". You can see that the chance of choosing each element is proportional to its weight.


I'm not sure the best way to test this, but starting with your example

const items = [1, 2, 3, 4, 5];
const weights = [1, 1, 1, 200, 1_000];

we can build an array of the form accepted above:

const arr = items.map((value, i) => ({ value, weight: weights[i] }));

run the shuffle a bunch of times and keep track of the results:

const results: number[][] = [];
const numTrials = 100_000;
for (let i = 0; i < numTrials; i++) {
    weightedShuffle(arr);
    results.push(arr.slice().map(v => v.value))
}

and then... well, the easiest thing to check is the relative weighting of the first element of the array for each result, since that should be exactly proportional to your weights:

const firstPos: Record<number, number> = {};
items.forEach(v => firstPos[v] = 0);
results.forEach(vals => firstPos[vals[0]] = (firstPos[vals[0]] ?? 0) + 1);
const totalWeight = weights.reduce((x, y) => x + y);

// this is the weighted occurrence of the first element of the shuffled array
console.log(Object.entries(firstPos).map(([k, v]) => [k, v * totalWeight / numTrials]));
// [["1", 0.93834], ["2", 0.98646], ["3", 1.02255], ["4", 199.20477], ["5", 1000.84788]] 

The actual logged results will depend on the random numbers chosen, but this is promising.

After this you could start checking the second element for each result conditional on the fact that the first element is not available, and show that the results are as expected. But frankly all we're doing is reverse engineering the Fisher-Yates shuffle and making sure the weighted index choice is in keeping with our expectations. Not sure it's worth doing.

Playground link to code

8
Severin Pappadeux On

Weighted Random Sampling, Efraimidis, Spirakis 2005

Link to paper: https://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf

UPDATE

Don't have Javascript skills, but paper above provided OPTIMAL algorithm for such shuffle. What is accepted as an answer is O(n2), which will show up on large data.

Efraimidis&Spirakis is O(n log(n)), basically sorting complexity.

People, READ PAPERS and don't invent wheels.

WRS algorithm, Python 3.10, Windows x64

import numpy as np

items = np.array([1, 2, 3, 4, 5])
freqs = np.array([1., 1., 1., 200., 1000.0])

wghts = freqs / np.sum(freqs)
print(wghts)

rng = np.random.default_rng(1357907531)

counter = np.zeros(len(items))

N = 1000000

for k in range(0, N):
    u01 = rng.random(len(items))
    ki  = np.power( u01 , 1.0/wghts)
    q = np.argsort(ki)
    
    counter[q[-1]] += 1

print(counter/np.sum(counter))

and it prints for normalized weights

[0.00083126 0.00083126 0.00083126 0.16625104 0.8312552]

and for sampling tests

[8.36000e-04 8.19000e-04 8.55000e-04 1.65964e-01 8.31526e-01]

Beautiful algorithm, applicable to streaming, reservoir style sampling, m from n sampling you name it.

UPDATE II

Here is another link to WRS implementations in Java, different variations of it, done by Efraimidis.

https://utopia.duth.gr/~pefraimi/projects/WRS/

Quick complexity analysis

There is one pass which generate n random numbers, timing n*TRNG, O(n)

There is another pass where we compute u01 to the power of inverse weight, power being done by log&exp calls n*(Tlog+Texp), O(n)

If loops are done manually, not by some vectorized library like numpy, they could be combined.

Last step is sorting, which is O(n log(n)), and for large n this term would dominate, making WRS algorithm complexity O(n log(n))

1
Chris On

I'm a little late to the party, but I wrote this a while ago (C# implementation): https://github.com/cdanek/KaimiraWeightedList

It seems as if there's a continued need for this in javascript and python, but I haven't really had time to port mine over. I'll try to do that soon.

This should get you started, though (on an O(1) time and space-complexity algorithm).