I have a CSV of about 400,000 "scored" rows where I've refit the score to be a linear distribution from 1 to 10, rounded to 5 decimals. (So from the top, the row[0] column is 10, 9.999997, 9.999995, etc.)
I want to create a script to pull X rows of average score Y from the list.
My expectation is something like a bell curve. It may be awkward/impossible at low or high values of X and/or Y, but if I pull 10,000 rows of average score 7, there should be a "few" at very low scores, and enough scores to smooth out a distribution.
My first thought was to load the values of row[0] into a list of numbers, force a number-by-number approximation toward a goal of 7, filling in the numbers into another list, then using that list to go back through the CSV and if row[0] is in output_list, out_writer([row]). But my guesswork stepwise math is probably very inefficient and I don't know what libraries could help me.
Input looks like:
Score Name
10.0 foo
9.99997 bar
9.99995 stuff
9.99992 thing
9.9999 other
etc.
I want to be able to input a large variable X and a score Y and output a CSV of X rows from the input file such that their average is Y. Non-trivially, of course (otherwise, I could just get the X/2 rows on either side of the goal score from the input file!) - a wider distribution would be preferred.
Ideally, I would find a solution that allows for asymmetric distributions. For example, if I wanted 100 numbers averaging to 9.0, I would expect about twenty numbers above 9.0 to counter a 1.0. Surely that could get messy, but I would also expect a 1.0 to be 1/20th as likely.
Found
scipy.stats.truncnorm
, seems like it would fit the bill. Writing a small wrapper to convert it from a standard normal curve and it works quite well.Play around with the standard deviations (
sdev
) a bit. 1/3 the distance from the closest edge seems it would be a safe bet (ie if Y==8 then sdev=(10-8)/3).