Linked Questions

Popular Questions

How to get labeled values inside a text with Machine Learning?

Asked by At

my mission is to extract important informations from energy bills (dates, costs, ...) in various formats (pdf, csv, xlsx, ...).

I'm wondering what is the best approach to carry out this task.

To accomplish this task so far my team was doing "regex matching": a Talend job converted invoices to textual data (OCR) and important values ​​were extracted by creating a regex with one or more capture groups for each piece of information. These regex are organized in files that become difficult to maintain as new invoices from new suppliers are imported (who don't always publish their invoices in the same way/format). To automate this extraction, I explored the machine learning way.

After training several algorithms I developed a set of multilabel RandomForests chains (ClassifierChain) to classify the different invoice lines into several categories, depending on the type of value to retrieve in each line.

I give you a preview of the dataset coming out of the Talend job. For each line of text there are several categories with the positions of the values ​​to extract for each category within the line and according to the corresponding regex.

You also have a part of the code I used to train the model without the positions.

My datas:

    text                                                cpfit_global
    TAXE_COMMUNALE_SUR_LA_CONSO_FINALE_D ELECTRICI...   NO_INSERT_TOTAL[$1]µNO_INSERT_TOTAL[$2]µNO_INS...
    % ABONNEMENT_DU 12/05/2017 AU 31/07/2017 8,85 ...   NO_INSERT_TOTAL[$1]µNO_INSERT_TOTAL[$2]µPRIX_P...
    ELECTRICITE_PERIODE_UNIQUE_DU 12/05/2017 AU 14...   NO_INSERT_TOTAL[$1]µNO_INSERT_TOTAL[$2]µCONSO_...
    TAXE_DEPARTEMENTALE_SUR_LA_CONSO_FINALE_D ELEC...   COUT_TCFE_TDCFE_TOTAL[$5]
    TARIF_BLEU_POUR_CLIENTS_NON_RESIDENTIELS_OPTIO...   PUISS_REDUITE_TOTAL[$1]
    % CONTRIBUTION_TARIFAIRE_D ACHEMINEMENT 8,30 2...   COUT_CTA_ASSIETTE_TOTAL[$1]µCOEFF_CTA_TOTAL[$2...
    CONTRIBUTION_AU_SERVICE_PUBLIC_DE_L ELECTRICIT...   NO_INSERT_TOTAL[$1]µNO_INSERT_TOTAL[$2]µCONSO_...
    TOTAL_TVA_POUR_LE_SITE 1,53                         COUT_TVA_TOTAL[$1]
    TVA_A 5,50% 25,76 1,41    TAXE_ASSIETTE_TVA_TR_TOTAL[$1]µCOUT_TVA_TR_TOT...
    TOTAL_TTC_POUR_LE_SITE 27,89                        COUT_TTC_TOTAL[$1]
    PUISSANCE_SOUSCRITE_ACTUELLE KW_OU_KVA : 6,0    PUISS_REDUITE_TOTAL[$1]
    BASE 2327 LE 14/03/2017 2423 LE 15/05/2017  INDEX_DEBUT_BASE[$1]µDATE_INDEX_DEBUT_BASE[$2]...
    TOTAL_HTVA_POUR_LE_SITE 26,36                    COUT_HORS_TVA_TOTAL[$1]
    BASE 7739 LE 13/03/2017 8125 LE 15/05/2017  INDEX_DEBUT_BASE[$1]µDATE_INDEX_DEBUT_BASE[$2]...
    TVA_A 20,00% 2,12 0,42  TAXE_ASSIETTE_TVA_TN_TOTAL[$1]µCOUT_TVA_TN_TOT...
    TOTAL_TVA_POUR_LE_SITE 2,96                         COUT_TVA_TOTAL[$1]
    TAXE_DEPARTEMENTALE_SUR_LA_CONSO_FINALE_D ELEC...   COUT_TCFE_TDCFE_TOTAL[$5]
    TOTAL_CONSOMMATIONS_FACTUREES 18 KWH    CONSO_ACTIF_TOTAL_TOTAL[$1]
    % ABONNEMENT_DU 12/05/2017 AU 31/07/2017 15,28...   NO_INSERT_TOTAL[$1]µNO_INSERT_TOTAL[$2]µPRIX_P...
    % CONTRIBUTION_TARIFAIRE_D ACHEMINEMENT 20,90 ...   COUT_CTA_ASSIETTE_TOTAL[$1]µCOEFF_CTA_TOTAL[$2...
    CONTRIBUTION_AU_SERVICE_PUBLIC_DE_L ELECTRICIT...   NO_INSERT_TOTAL[$1]µNO_INSERT_TOTAL[$2]µCONSO_...
    PUISSANCE_SOUSCRITE_ACTUELLE KW_OU_KVA : 15,0    PUISS_REDUITE_TOTAL[$1]
    TOTAL_HTVA_POUR_LE_SITE 48,38                    COUT_HORS_TVA_TOTAL[$1]
    TARIF_BLEU_POUR_CLIENTS_NON_RESIDENTIELS_OPTIO...   PUISS_REDUITE_TOTAL[$1]
    TVA_A 5,50% 46,26 2,54  TAXE_ASSIETTE_TVA_TR_TOTAL[$1]µCOUT_TVA_TR_TOT...
    ELECTRICITE_PERIODE_UNIQUE_DU 12/05/2017 AU 14...   NO_INSERT_TOTAL[$1]µNO_INSERT_TOTAL[$2]µCONSO_...
    TOTAL_TTC_POUR_LE_SITE 51,34                        COUT_TTC_TOTAL[$1]
    TAXE_COMMUNALE_SUR_LA_CONSO_FINALE_D ELECTRICI...   NO_INSERT_TOTAL[$1]µNO_INSERT_TOTAL[$2]µNO_INS...

Sorry for the bad alignment. Each position is represented by a $ followed by a number which represents the n-th value / date in each line.

My code:

import pandas as pd
import numpy as np
import random
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import ClassifierChain
import clean_targets, clean_text

stemmer = SnowballStemmer("french", ignore_stopwords=True)
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super().build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

frame = pd.read_csv(file_)
frame["cpfit_cleaned"] = clean_targets(frame["cpfit_global"]) #remove positions inside targets (positions are between brackets)
frame['clean_text'] = clean_text(frame['text'])

X = frame['clean_text'].values
Y = frame['cpfit_cleaned'].str.get_dummies('µ')

stemmed_count_vect = StemmedCountVectorizer(stop_words=stopwords.words('french'))
model = [Pipeline([('vect', stemmed_count_vect),
      ('tfidf', TfidfTransformer(use_idf=True)),
      ('clf-svm', ClassifierChain(RandomForestClassifier(n_estimators=10, random_state=42), order='random', random_state=i))]) for i in range(10)]

i=1
for chain in model:
    print("fitting chain {}...".format(i))
    chain.set_params(vect__ngram_range=(2,2)).fit(X, Y)
    i += 1

And to predict:

y_pred_chains = np.array([chain.predict(X_test) for chain in model])
y_pred_ensemble = y_pred_chains.mean(axis=0) >= .5

Currently, I still don't know how to retrieve the position of the values ​​to extract, which depends on the numeric values and dates inside each line ...

I'm thinking of a method of extracting values ​​directly from the pdf but no software is able to properly analyze the tables in the invoices (I tried Camelot and tabula). Maybe a deep learning algorithm that could, from pdfs, retrieve the information, extract it and label it automatically ? or a specific NER method that can extract entities like dates, prices, costs and label them ? (but i think it's not possible)

What's your opinion about this problem ?

Related Questions