Suppose I had a pandas series of dollar values and wanted to discretize into 9 groups using qcut
. The # of observations is not divisible by 9. SQL Server's ntile
function has a standard approach for this case: it makes the first n out of 9 groups 1 observation larger than the remaining (9-n) groups.
I noticed in pandas that the assignment of which groups had x observations vs x + 1 observations seemed random. I tried to decipher the code in algos to figure out how the quantile function deals with this issue but could not figure it out.
I have three related questions:
- Any pandas developers out there than can explain
qcut
's behavior? Is it random which groups get the larger number of observations? - Is there a way to force
qcut
to behave similarly toNTILE
(i.e., first groups get x + 1 observations)? - If the answer to #2 is no, any ideas on a function that would behave like
NTILE
? (If this is a complicated endeavor, just an outline to your approach would be helpful.)
Here is an example of SQL Server's NTILE
output.
Bin |# Observations
1 26
2 26
3 26
4 26
5 26
6 26
7 26
8 25
9 25
Here is pandas:
Bin |# Observations
1 26
2 26
3 26
4 25 (Why is this 25 vs others?)
5 26
6 26
7 25 (Why is this 25 vs others?)
8 26
9 26
The
qcut
behaves like this because it's more accurate. Here is an example:for the ith level, it starts at quantile (i-1)*10%:
the output is:
If you want NTILE, you can calculate the quantiles yourself:
the output is: