I am looking at the source code of the InstanceHardnessThreshold
transformer from imbalanced-learn
, here: https://github.com/scikit-learn-contrib/imbalanced-learn/blob/12b2e0d/imblearn/under_sampling/_prototype_selection/_instance_hardness_threshold.py#L167
And I am wondering how exactly the threshold is calculated and what the rationale is?
After discussing with the maintainers of the imbalanced-learn package, this is what I learned:
The threshold is determined as follows:
where n_samples is the number of samples desired in the final dataset from the majority class and target_stats[target_class] is the total number of the majority class present in the original dataset.
We need to find a probability threshold such that the number of samples above that threshold agrees with the number of samples requested in
sampling_strategy
. By default, it will be the number of samples in the minority class, unless the users declares otherwise.Instance hardness is the probability of an observation being miss classified. In other words, it is 1 - probability of the class.
The idea is that the probabilities given by the estimator are related to the certainty for a sample to belong to the class. Therefore, a percentile of 0.0 would mean that we select all samples while a percentile of 1.0 mean that we will select a single sample (the one with the maximum probability). So the threshold corresponds to select the N most certain samples to belong to class C as seen per the estimator. N is defined by the sampling_strategy parameter (e.g., the expected balancing ratio).
This method may return more observations than those requested by the user. This is mentioned in the documentation.