How to get time and date or specific product name using NLTK?

275 views Asked by At
doc = '''Andrew Yan-Tak Ng is a Chinese American computer scientist.He is the former chief scientist at Baidu, where he led the company's
Artificial Intelligence Group. He is an adjunct professor (formerly associate professor) at Stanford University. Ng is also the co-founder
and chairman at Coursera, an online education platform. Andrew was born in the UK on 27th Sep 2.30pm 1976. His parents were both from Hong Kong.'''

# tokenize doc
tokenized_doc = nltk.word_tokenize (doc)

# tag sentences and use nltk's Named Entity Chunker
tagged_sentences = nltk.pos_tag (tokenized_doc)
ne_chunked_sents = nltk.ne_chunk (tagged_sentences)

When you process and extract chucks..I see we only get [('Andrew', 'PERSON'), ('Chinese', 'GPE'), ('American', 'GPE'), ('Baidu', 'ORGANIZATION'), ("company's Artificial Intelligence Group", 'ORGANIZATION'), ('Stanford University', 'ORGANIZATION'), ('Coursera', 'ORGANIZATION'), ('Andrew', 'PERSON'), ('UK', 'ORGANIZATION'), ('Hong Kong', 'GPE')]

I need to get the time and date too? Please suggest... Thank you.

1

There are 1 answers

1
Lafa On BEST ANSWER

You need a more sophisticated tagger like the Stanford's Named Entity Tagger. Once you have it installed and configured, you can run it:

from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize

stanfordClassifier = '/path/to/classifier/classifiers/english.muc.7class.distsim.crf.ser.gz'
stanfordNerPath = '/path/to/jar/stanford-ner/stanford-ner.jar'

st = StanfordNERTagger(stanfordClassifier, stanfordNerPath, encoding='utf8')

doc = '''Andrew Yan-Tak Ng is a Chinese American computer scientist.He is the former chief scientist at Baidu, where he led the company's Artificial Intelligence Group. He is an adjunct professor (formerly associate professor) at Stanford University. Ng is also the co-founder and chairman at Coursera, an online education platform. Andrew was born in the UK on 27th Sep 2.30pm 1976. His parents were both from Hong Kong.'''

result = st.tag(word_tokenize(doc))

date_word_tags = [wt for wt in result if wt[1] == 'DATE' or wt[1] == 'ORGANIZATION']

print date_word_tags

Where the output would be:

[(u'Artificial', u'ORGANIZATION'), (u'Intelligence', u'ORGANIZATION'), (u'Group', u'ORGANIZATION'), (u'Stanford', u'ORGANIZATION'), (u'University', u'ORGANIZATION'), (u'Coursera', u'ORGANIZATION'), (u'27th', u'DATE'), (u'Sep', u'DATE'), (u'2.30pm', u'DATE'), (u'1976', u'DATE')]

You will probably run into some issues when trying to install and set up everything, but I think it's worth the hassle.

Let me know if it helps.