How to predict the incertainty with predict marginals of scikit-learn

62 views Asked by At

My aim is to identify named entities and use active learning. To do this, I would like to obtain some sort of uncertainty score for each document for which I want to identify.

I have a text file in which I have several documents that follow the following format: one sentence, one line break, several sentences (=doc number 1), one sentence, one line break, several sentences (=doc number 2). I'd like to calculate the average probability of uncertainty for each document to see which document I should choose first for my training (= active learning).

To do this I use "predict_marginals" in my crf model, which returns several probabilities for each token, corresponding to each class. I'd just like to assign an average score to each of my documents in order to see which documents have the lowest probabilities and therefore which are the most uncertain. But I don't really know how to do this as I always have several probabilities per token. Do you have any advice or ideas? Because at the moment, I'm doing 1 - the maximum probability for each token (uncertainty formula) and then I'm averaging each of these values for each sentence. But 1) I'm not sure I should do it that way, and 2) it works at sentence level and not at document level ...

def classifier_uncertainty(self,x):
    """
    Retourne un tableau d'incertitudes pour chaque phrase grâce à la formule d'incertitude (1-max(proba)) pour chaque token et ensuite moyenne de ces tokens pour avoir une moyenne par phrase
    """
    predictions = self.predict_proba(x) #a list of sentences where each token in each sentence is represented by several probabilities (depending on the number of labels)
    resultat = 1-predictions.max(axis=-1)
    uncertainties = np.mean(resultat,axis=1)
    return uncertainties
1

There are 1 answers

3
naved196 On

Try this

with open("text_file.txt", "r") as f:

txt = f.read()

documents = txt.split("\n\n")

def calc_avg_uncert(marginal_probabilities):

 avg_uncert = []

 for doc_probabilities in marginal_probabilities:

    entropy = -np.sum(doc_probabilities * np.log(doc_probabilities), axis=1)

    avg_uncert.append(np.mean(entropy))

return avg_uncert

with open("text_file.txt", "r") as f:

txt = f.read()

documents = txt.split("\n\n")

#Preprocess documents here

marginal_probabilities = crf_model.predict_marginals(processed_documents)

avg_uncert = calc_avg_uncert (marginal_probabilities)

document_with_highest_uncertainty = documents[np.argmax(avg_uncert)]

print("Document with highest uncertainty:")

print(document_with_highest_uncertainty)

This program identifies the highest average uncertainties which can then be used for your active learning.