I have an unbalanced multi-class banking text data with around 76 classes. Classes are badly distributed such as one class which is combination of 240 other different categories, represents 50% of data. rest 75 classes are highly unbalanced. I have tried data balancing techniques such as:- model class weight, random under sampling, random over sampling, but none of them could get me weighted accuracy more than 60%. Class_label counts are for reference as:
- 0.48 (combined class)
- 0.08
- 0.06
- 0.004
and so on similar pattern.
I am going to try data augmentation approach, please suggest how to use this approach as per class availability and populate data frame quickly. the code for augmentation is as follows:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
aug=nac.keybordAug()
aug=naw.SynonymAug(aug_sr'wordnet',lang='eng')
augmented_text=aug.augmented(df['text'][0],1) # 1 used for 3 augmented text will be generated
print(augmented_text)
I want to create new data frame with augmented text inserted multiple times as per their ratio to balance dataset.
Please free to suggest any other approach also. as per problem statement.