metric learning for information retrieval in semi-structured text?

182 views Asked by At

I am interested in parsing semi-structured text. Assuming I have a text with labels of the kind: year_field, year_value, identity_field, identity_value, ..., address_field, address_value and so on.

These fields and their associated values can be everywhere in the text, but usually they are near to each other, and more generally the text in organized in a (very) rough matrix, but rather often the value is just after the associated field with eventually some non-interesting information in between.

The number of different format can be up to several dozens, and is not that rigid (do not count on spacing, moreover some information can be added and removed).

I am looking toward machine learning techniques to extract all those (field,value) of interest.

I think metric learning and/or conditional random fields (CRF) could be of a great help, but I have not practical experience with them.

Does anyone have already encounter a similar problem?

Any suggestion or literature on this topic?

1

There are 1 answers

3
AvidLearner On

Your task, if I understand correctly, is to extract all pre-defined entities from a text. What you describe here is exactly named entity recognition.

Stanford has a Stanford Named Entity Recognizer that you can download and use (python/java and more)

Regarding the models you considers (CRF for example) - the hard thing here is to get the training data - sentences with the entities already labeled. This is why you should consider getting a trained model, or use someone else's data to train your model (again, the model will recognize only entities it saw in the training part)

A great choice for already train model in python is nltk's Information Extraction module.

Hope this sums it up