Here is the code snippet:
In [390]: t
Out[390]: ['my', 'phone', 'number', 'is', '1111', '1111', '1111']
In [391]: ner_tagger.tag(t)
Out[391]:
[('my', 'O'),
('phone', 'O'),
('number', 'O'),
('is', 'O'),
('1111\xa01111\xa01111', 'NUMBER')]
What I expect is:
Out[391]:
[('my', 'O'),
('phone', 'O'),
('number', 'O'),
('is', 'O'),
('1111', 'NUMBER'),
('1111', 'NUMBER'),
('1111', 'NUMBER')]
As you can see the artificial phone number is joined by \xa0 which is said to be a non-breaking space. Can I separate that by setting the CoreNLP without changing other default rules.
The ner_tagger is defined as:
ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
TL;DR
NLTK was reading the list of tokens into a string and before passing it to the CoreNLP server. And CoreNLP retokenize the inputs and concatenated the number-like tokens with
\xa0
(non-breaking space).In Long
Lets walk through the code, if we look at the
tag()
function fromCoreNLPParser
, we see that it calls thetag_sents()
function and converted the input list of strings into a string before calling theraw_tag_sents()
which allowsCoreNLPParser
to re-tokenized the input, see https://github.com/nltk/nltk/blob/develop/nltk/parse/corenlp.py#L348:And when calling then the
raw_tag_sents()
passes the input to the server using theapi_call()
:So the question is how to resolve the problem and get the tokens as it's passed in?
If we look at the options for the Tokenizer in CoreNLP, we see the
tokenize.whitespace
option:If we make some changes to the allow additional
properties
before callingapi_call()
, we can enforce the tokens as it's passed to the CoreNLP server joined by whitespaces, e.g. changes to the code:After changing the above code: