I have an array of 20k real numbers, and I use pd.DataFrame(scores).plot.kde(figsize=(24,8))
to get the below kernel density estimation. How can I purely programmatically select the indexes of the sparse regions, or conversely the dense region?
My current approach is of the form np.where(scores > np.percentile(scores, 99))[0]
, I am very of such hard cording of 99
as it may not work too well in production. A potential solution which I'm not sure how to approach is selecting the indices where the Density is below 20,000
Which region to consider "sparse" and which "dense" can be very subjective. It also heavily depends on the signification of the data. An idea is to decide upon some cut-off percentiles. The example below uses the lowest
0.1 %
and highest99.9 %
.