Customised tokens annotation in R

282 views Asked by At

Currently I'm working on an NLP project. It's totally new for me that's why i'm really struggling with implementation of NLP techniques in R. Generally speaking, I need to extract machines entities from descriptions. I have a dictionary of machines which contains 2 columns: Manufacturer and Model.

To train the extraction model, I have to have an annotated corpus. That's where I'm stuck. How to annotate machines in text? Here is an example of the text:

The Skyjack 3219E electric scissor lift is a self-propelled device powered by 4 x 6 V batteries. The machine is easy to charge, just plug it into the mains. This unit can be used in construction, manufacturing and maintenance operations as a working installation on any flat paved surface. You can use it both indoors and outdoors. Thanks to its non-marking tyres, the machine does not leave any visible tracks on floors. The machine can be driven at full height and is very easy to operate. The S3219E has a 250 kg platform payload capacity. It can handle two people when operating indoors and one outdoors. Discover our trainings via Heli Safety Academy.

Skyjack 3219E - this is a machine which has to be identified and tagged. I wanna have results similar to POS tagging but instead of nouns and verbs - manufacturer and model. All the other words might be tagged as irrelevant.

Manual annotation is very expensive and not an option as usually descriptions are really long and messy.

Is there a way to adapt POS tagger and use a customised dictionary for tagging? Any help is appreciated!

1

There are 1 answers

3
parsethis On BEST ANSWER

Edit: ( At the end of writing this I realized you plan on using R, all my algorithmic suggestions are based on python implementations but I hope you can still get some ideas from the answer )

In general this is considered an NER (named entity recognition) problem. I am doing work on a similar problem at my job.

Is there any general structure to the text?

For example does the entity name generally occur in the first sentence? This maybe a way to simplify a heuristic search or a search based a dictionary (of Known products for instance).

Is annotation that prohibitive?

A weeks worth of tagging could be all you need given that you essentially have to one label that you care about. I was working on discovering brand names in a unstructured sentences, we did quite well with a week's work of annotation and training a CRF ( Conditional Random Fields ) model. see pycrfsuite a good python wrapper of a fast c++ implementation of CRF

[EDIT]

For annotation I used a variant BIO tagging scheme.

This what typical sentence like: "We would love a victoria's secret in our neighborhood", would look like when tagged.

We O
would O
love O
a O
victoria B-ENT
's I-ENT
secret I-ENT

O represented words that are Outside of the entities I cared about (brands). B represented the Beginning of entity phrases and I represents Inside of entity phrases.

In your case you seem to want to separate the manufacturer and the model item. So you can use tags like B-MAN, I-MAN, B-MOD, I-MOD. Here is an example of annotating:

The O 
Skyjack B-MAN
3219E B-MOD
electric O
scissor O
lift O
etc..

of course a manufacture of a model can have multiple words in their names so use the I-MOD, I-MAN tags to capture that (see the example from my work above)

See this link ( ipython notebook) for a full example of how tagged sequences look for me. I based my work on this.

Build A big dictionary

We scrapped the internet, used or own data got databases from partners. And build a huge dictionary that we used as features in our CRF and for general searches. see ahocorosick for a fast trie based keyword search in python.

Hope some of this helps!