Hey there I'm working on a textgenerator, that should generate millions of different texts. to make the content of each text realistic I used Zipf's Law It is working well, the word distribution is correct.
But the following next()
function is executing very slow and since I want to generate millions of articles it has to be changed. (The while loop is the slow part)
Can someone help me with this?
I implemented it like this:
public int next() {
int rank;
double frequency = 0;
double dice;
rank = rnd.nextInt(size);
frequency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
dice = rnd.nextDouble();
while (!(dice < frequency) || (rank == 0)) {
rank = rnd.nextInt(size);
frequency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
dice = rnd.nextDouble();
}
return rank;
}
EDIT: I obtainded the code from : http://diveintodata.org/2009/09/13/zipf-distribution-generator-in-java/
The implementation that you copied ... has some issues. One could probably say that it is plainly wrong, because it is using random values, and when in a computation like
the
rank
value is0
, then the frequency isInfinity
, and messes up some of the statistics.I tried to correct these erros, but have not analyzed the implementation and have not compared it to the definition of the Zipf distribution function. So if somebody copies my code, he might find out that it still "...has some issues".
The implementation of the
next
function is, strictly speaking, not "total correct", in the sense that it does not necessarily terminate. There's nothing preventing the loop from running forever. Depending on the parameters, it may be more or less likely that it will take a while until it terminates. And I think that's also one of the main reasons for your "performance" issue: For some values, the condition(dice < frequency)
is just very unlikely to happen....Regardless of that, the goal that you want to achieve can be formulated more generically: You have a certain distribution of probabilities. And you want a "random" function that returns random values based on this distribution.
One simple and generic way to achieve this is to map the (cumulated) probability distribution to the target values with a
NavigableMap
. This map can then be used to quickly look up the target value, given a random value between 0.0 and 1.0 that is supplied by ajava.util.Random
instance.There may be more efficient solutions for particular cases, but again: This is very generic and simple (and still, reasonably efficient).
I implemented this here for the Zipf distribution. Again, I did not verify everything in detail, and there are some
+1
/-1
oddities (mentioned in the first paragraph), but it should show the idea: TheFastZipfGenerator
fills the map containing the probability distribution, and in thenext()
function, just performs a lookup:It prints a random sample result (basically, a "histogram"), and some timing results. The timing results are something like
showing that it will most likely be faster (even though this should not be considered as a "benchmark"...)