Zipf's Law in Java for text generation - too slow

2.5k views Asked by At

Hey there I'm working on a textgenerator, that should generate millions of different texts. to make the content of each text realistic I used Zipf's Law It is working well, the word distribution is correct.

But the following next() function is executing very slow and since I want to generate millions of articles it has to be changed. (The while loop is the slow part)

Can someone help me with this?

I implemented it like this:

   public int next() {

    int rank;
    double frequency = 0;
    double dice;

    rank = rnd.nextInt(size);
    frequency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
    dice = rnd.nextDouble();


    while (!(dice < frequency) || (rank == 0)) {
        rank = rnd.nextInt(size);
        frequency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
        dice = rnd.nextDouble();
    }

    return rank;
}

EDIT: I obtainded the code from : http://diveintodata.org/2009/09/13/zipf-distribution-generator-in-java/

3

There are 3 answers

7
Marco13 On BEST ANSWER

The implementation that you copied ... has some issues. One could probably say that it is plainly wrong, because it is using random values, and when in a computation like

rank = rnd.nextInt(size);
friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;

the rank value is 0, then the frequency is Infinity, and messes up some of the statistics.

I tried to correct these erros, but have not analyzed the implementation and have not compared it to the definition of the Zipf distribution function. So if somebody copies my code, he might find out that it still "...has some issues".


The implementation of the next function is, strictly speaking, not "total correct", in the sense that it does not necessarily terminate. There's nothing preventing the loop from running forever. Depending on the parameters, it may be more or less likely that it will take a while until it terminates. And I think that's also one of the main reasons for your "performance" issue: For some values, the condition (dice < frequency) is just very unlikely to happen....


Regardless of that, the goal that you want to achieve can be formulated more generically: You have a certain distribution of probabilities. And you want a "random" function that returns random values based on this distribution.

One simple and generic way to achieve this is to map the (cumulated) probability distribution to the target values with a NavigableMap. This map can then be used to quickly look up the target value, given a random value between 0.0 and 1.0 that is supplied by a java.util.Random instance.

There may be more efficient solutions for particular cases, but again: This is very generic and simple (and still, reasonably efficient).


I implemented this here for the Zipf distribution. Again, I did not verify everything in detail, and there are some +1/-1 oddities (mentioned in the first paragraph), but it should show the idea: The FastZipfGenerator fills the map containing the probability distribution, and in the next() function, just performs a lookup:

import java.util.LinkedHashMap;
import java.util.Map;
import java.util.NavigableMap;
import java.util.Random;
import java.util.TreeMap;

public class ZipfGeneratorTest
{
    public static void main(String[] args) {

        int size = 10;
        double skew = 2.0;

        ZipfGenerator z0 = new ZipfGenerator(size, skew);
        FastZipfGenerator z1 = new FastZipfGenerator(size, skew);

        long before = 0;
        long after = 0;

        int n = 5000000;

        before = System.nanoTime();
        Map<Integer, Integer> counts0 = computeCounts(z0, size, n);
        after = System.nanoTime();
        System.out.println(counts0+", duration "+(after-before)/1e6);

        before = System.nanoTime();
        Map<Integer, Integer> counts1 = computeCounts(z1, size, n);
        after = System.nanoTime();
        System.out.println(counts1+", duration "+(after-before)/1e6);
    }

    private static Map<Integer, Integer> computeCounts(
        ZipfGenerator z, int size, int n)
    {
        Map<Integer, Integer> counts = new LinkedHashMap<Integer, Integer>();
        for (int i=1; i<=size; i++)
        {
            counts.put(i, 0);
        }
        for (int i=1; i<=n; i++)
        {
            int k = z.next();
            counts.put(k, counts.get(k)+1);
        }
        return counts;
    }

    private static Map<Integer, Integer> computeCounts(
        FastZipfGenerator z, int size, int n)
    {
        Map<Integer, Integer> counts = new LinkedHashMap<Integer, Integer>();
        for (int i=1; i<=size; i++)
        {
            counts.put(i, 0);
        }
        for (int i=1; i<=n; i++)
        {
            int k = z.next();
            counts.put(k, counts.get(k)+1);
        }
        return counts;
    }

}

// Based on http://diveintodata.org/tag/zipf/
class ZipfGenerator {
    private Random rnd = new Random(0);
    private int size;
    private double skew;
    private double bottom = 0;

    public ZipfGenerator(int size, double skew) {
        this.size = size;
        this.skew = skew;

        for(int i=1;i <=size; i++) {
            this.bottom += (1/Math.pow(i, this.skew));
        }
    }

    // the next() method returns an random rank id.
    // The frequency of returned rank ids are follows Zipf distribution.
    public int next() {
        int rank;
        double friquency = 0;
        double dice;

        rank = rnd.nextInt(size)+1;
        friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
        dice = rnd.nextDouble();

        while(!(dice < friquency)) {
            rank = rnd.nextInt(size)+1;
            friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
            dice = rnd.nextDouble();
        }

        return rank;
    }


    // This method returns a probability that the given rank occurs.
    public double getProbability(int rank) {
        return (1.0d / Math.pow(rank, this.skew)) / this.bottom;
    }
}



class FastZipfGenerator
{
    private Random random = new Random(0);
    private NavigableMap<Double, Integer> map;

    FastZipfGenerator(int size, double skew)
    {
        map = computeMap(size, skew);
    }

    private static NavigableMap<Double, Integer> computeMap(
        int size, double skew)
    {
        NavigableMap<Double, Integer> map = 
            new TreeMap<Double, Integer>();

        double div = 0;
        for (int i = 1; i <= size; i++)
        {
            div += (1 / Math.pow(i, skew));
        }

        double sum = 0;
        for(int i=1; i<=size; i++)
        {
            double p = (1.0d / Math.pow(i, skew)) / div;
            sum += p;
            map.put(sum,  i-1);
        }
        return map;
    }

    public int next()
    {
        double value = random.nextDouble();
        return map.ceilingEntry(value).getValue()+1;
    }

}

It prints a random sample result (basically, a "histogram"), and some timing results. The timing results are something like

duration 6221.835052
duration 304.761282

showing that it will most likely be faster (even though this should not be considered as a "benchmark"...)

0
maaartinus On

You were asking about speed, so I'm presenting a minor optimization. First of all, get rid of the repetitive stuff to see what it's all about:

public int next() {
    while (true) {
        int rank = rnd.nextInt(size);
        if (rank == 0) return return rank;
        double frequency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
        double dice = rnd.nextDouble();
        if (dice < frequency) return rank;
    }
}

So far it should work exactly the same (unless I overlooked something). I moved the test of rank upwards as the following computation is useless in case it's zero. Now there's a single line which we can speed up a bit like

double frequency = Math.pow(rank, -this.skew) * inverseBottom;

Actually, this may slightly change the result due to round-off errors, but I doubt you should care. If rank was constant, you could turn pow into exp to make it faster, but it's not. For a small size, you could precompute a table of ln(rank) and use it like

double frequency = Math.exp(ln[rank] * -this.skew) * inverseBottom;

A better algorithm could surely give you more than this low-level optimizations.

0
user2959347 On

The source you obtained from https://diveintodata.org/2009/09/13/zipf-distribution-generator-in-java/ has some bugs.

Here are quick fixes. (1) In the constructor ZipfGeneator(int,double), make sure to compute up to size using equal sign.

public ZipfGenerator(int size, double skew) {
  this.size = size;
  this.skew = skew;

  for(int i=1;i <= size; i++) {
  this.bottom += (1/Math.pow(i, this.skew));
  }
 }

(2) replace

rank = rnd.nextInt(size); 

with

rank = rnd.nextInt(size)+1; 

Here is the complete source code.

import java.util.Random;

//credit: https://diveintodata.org/2009/09/13/zipf-distribution-generator-in-java/ [Online; December 2017]

public class ZipfGenerator {
 private Random rnd = new Random(System.currentTimeMillis());
 private int size;
 private double skew;
 private double bottom = 0;

 public ZipfGenerator(int size, double skew) {
  this.size = size;
  this.skew = skew;

  for(int i=1;i <= size; i++) {
  this.bottom += (1/Math.pow(i, this.skew));
  }
 }

 // the next() method returns an random rank id.
 // The frequency of returned rank ids are follows Zipf distribution.
 public int next() {
   int rank;
   double friquency = 0;
   double dice;

   rank = rnd.nextInt(size)+1;
   friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
   dice = rnd.nextDouble();

   while(!(dice < friquency)) {
     rank = rnd.nextInt(size)+1;
     friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
     dice = rnd.nextDouble();
   }

   return rank;
 }

 // This method returns a probability that the given rank occurs.
 public double getProbability(int rank) {
   return (1.0d / Math.pow(rank, this.skew)) / this.bottom;
 }

 public static void main(String[] args) {
   if(args.length != 2) {
     System.out.println("usage: ./zipf size skew");
     System.exit(-1);
   }

   ZipfGenerator zipf = new ZipfGenerator(Integer.valueOf(args[0]),
   Double.valueOf(args[1]));
   for(int i= 1;i <= 10; i++) {
     System.out.println(i+" "+zipf.getProbability(i));
   }
   //use size = 10 and skew = 2 for testing below
   int hist [] = new int [12];
   for(int i=0;i<12;i++) {
       hist[i] = 0;
   }
   System.out.println("Testing the probability distribution:");
   int sum = 0;
    for(int i= 1;i <= 1000000; i++) {
        hist[zipf.next()]++; 
   }
   for(int i=0;i<12;i++)
     System.out.println(i+" "+hist[i]/1000000.0);
    }

}

Result:

Probability distribution from the formula:
1 0.6452579827864142
2 0.16131449569660355
3 0.07169533142071269
4 0.04032862392415089
5 0.02581031931145657
6 0.017923832855178172
7 0.013168530260947229
8 0.010082155981037722
9 0.007966147935634743
10 0.006452579827864143
Testing the probability distribution from sampling:
0 0.0
1 0.645813
2 0.160766
3 0.071527
4 0.040346
5 0.026039
6 0.01801
7 0.013215
8 0.009953
9 0.007863
10 0.006468
11 0.0

Note, 0 and 11 has 0 probability as expected.