I'm trying to do Named Entity Recognition on search engine queries with Python.
The big thing about search engine queries are that they are usually incomplete or all lowercase.
For this task, I've been recommended Spacy, NLTK, Stanford NLP, Flair, Transformers by Hugging Face as some approaches to this problem.
I was wondering if anybody in the SO community knew the best approach to dealing with NER for search engine queries, because so far I've ran into problems.
For example, with Spacy:
import spacy
# Load the pre-trained model
nlp = spacy.load("en_core_web_sm")
# Process a text
text = "google and apple are looking at buying u.k. startup for $1 billion"
text = "who is barack obama"
doc = nlp(text)
# Extract entities
for ent in doc.ents:
print(ent.text, ent.label_)
For the first query I got:
google ORG
u.k. GPE
$1 billion MONEY
This is a great answer. However, for the search query "who is barack obama", in lower case, it returned no entities.
I'm sure I'm not the first person to do NER on search engine queries in Python, so I'm hoping to find someone who can point me in the right direction.
Problem
Most of the NER models focus on Cased tokens as the main feature.
Solution
I would try GPT models, as they have been trained on masking and context tasks, so they should be able to recognise entities based on the context.
I run a quick expeirment with chatgpt.
Prompt:
It responded well in your use case (try it in the chatgpt app!)
Code
The following code and dependencies should do the trick on a first appproachwith OpenAI models
(It has been difficult to find the current combination of versions, openAI recently migrated to new API so tutorials now are in the wild...)
The output: