Random Distribution to fit an Average Python

211 views Asked by At

I have a CSV of about 400,000 "scored" rows where I've refit the score to be a linear distribution from 1 to 10, rounded to 5 decimals. (So from the top, the row[0] column is 10, 9.999997, 9.999995, etc.)

I want to create a script to pull X rows of average score Y from the list.

My expectation is something like a bell curve. It may be awkward/impossible at low or high values of X and/or Y, but if I pull 10,000 rows of average score 7, there should be a "few" at very low scores, and enough scores to smooth out a distribution.

My first thought was to load the values of row[0] into a list of numbers, force a number-by-number approximation toward a goal of 7, filling in the numbers into another list, then using that list to go back through the CSV and if row[0] is in output_list, out_writer([row]). But my guesswork stepwise math is probably very inefficient and I don't know what libraries could help me.

Input looks like:

Score     Name
10.0      foo
9.99997   bar
9.99995   stuff
9.99992   thing
9.9999    other

etc.

I want to be able to input a large variable X and a score Y and output a CSV of X rows from the input file such that their average is Y. Non-trivially, of course (otherwise, I could just get the X/2 rows on either side of the goal score from the input file!) - a wider distribution would be preferred.

Ideally, I would find a solution that allows for asymmetric distributions. For example, if I wanted 100 numbers averaging to 9.0, I would expect about twenty numbers above 9.0 to counter a 1.0. Surely that could get messy, but I would also expect a 1.0 to be 1/20th as likely.

1

There are 1 answers

1
Kyle G On BEST ANSWER

Found scipy.stats.truncnorm, seems like it would fit the bill. Writing a small wrapper to convert it from a standard normal curve and it works quite well.

from scipy.stats import truncnorm

def my_norm(start, end, mean=0, sdev=1, size=None):
    a = (start - mean)/sdev
    b = (end - mean)/sdev
    rv = truncnorm(a, b)
    return rv.rvs(size)*sdev + mean

Play around with the standard deviations (sdev) a bit. 1/3 the distance from the closest edge seems it would be a safe bet (ie if Y==8 then sdev=(10-8)/3).