Get sparse region of KDE

164 views Asked by At

I have an array of 20k real numbers, and I use pd.DataFrame(scores).plot.kde(figsize=(24,8)) to get the below kernel density estimation. How can I purely programmatically select the indexes of the sparse regions, or conversely the dense region?

My current approach is of the form np.where(scores > np.percentile(scores, 99))[0], I am very of such hard cording of 99 as it may not work too well in production. A potential solution which I'm not sure how to approach is selecting the indices where the Density is below 20,000

image

1

There are 1 answers

1
JohanC On BEST ANSWER

Which region to consider "sparse" and which "dense" can be very subjective. It also heavily depends on the signification of the data. An idea is to decide upon some cut-off percentiles. The example below uses the lowest 0.1 % and highest 99.9 %.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

df = pd.DataFrame({'score': np.random.randn(2000, 10).cumsum(axis=0).ravel()})
df['score'].quantile([.01, .99])
ax = df.plot.kde(figsize=(24, 8))
ax.axvline(df['score'].quantile(.001), color='crimson', ls=':')
ax.axvline(df['score'].quantile(.999), color='crimson', ls=':')
ax.set_ylim(ymin=0) # avoid the kde "floating in the air"
plt.show()

example plot