I am classifying the input sentence to different category. like time, distance, speed, location etc
I trained classifier using MultinomialNB.
Classifier considers mainly tf
as feature, I also tried with considering sentence structure (using 1-4 grams)
Using multinomialNB
with alpha
= 0.001 this is the result for few queries
what is the value of Watch
{"1": {"other": "33.27%"}, "2": {"identity": "25.40%"}, "3": {"desc": "16.20%"}, "4": {"country": "9.32%"}}
what is the price of Watch
{"1": {"other": "25.37%"}, "2": {"money": "23.79%"}, "3": {"identity": "19.37%"}, "4": {"desc": "12.35%"}, "5": {"country": "7.11%"}}
what is the cost of Watch
{"1": {"money": "48.34%"}, "2": {"other": "17.20%"}, "3": {"identity": "13.13%"}, "4": {"desc": "8.37%"}} #for above two query also result should be money
How early can I go to mumbai
{"1": {"manner": "97.77%"}} #result should be time
How fast can I go to mumbai
{"1": {"speed": "97.41%"}}
How come can I go to mumbai
{"1": {"manner": "100.00%"}}
How long is a meter
{"1": {"period": "90.74%"}, "2": {"dist": "9.26%"}} #better result should be distance
Using multinomialNW
with considering ngram
(1-4)
what is the value of Watch
{"1": {"other": "33.27%"}, "2": {"identity": "25.40%"}, "3": {"desc": "16.20%"}, "4": {"country": "9.32%"}}
what is the price of Watch
{"1": {"other": "25.37%"}, "2": {"money": "23.79%"}, "3": {"identity": "19.37%"}, "4": {"desc": "12.35%"}, "5": {"country": "7.11%"}}
what is the cost of Watch
{"1": {"money": "48.34%"}, "2": {"other": "17.20%"}, "3": {"identity": "13.13%"}, "4": {"desc": "8.37%"}} # for above two query also result should be money
How early can I go to mumbai
{"1": {"manner": "97.77%"}} #result should be time
How fast can I go to mumbai
{"1": {"speed": "97.41%"}}
How come can I go to mumbai
{"1": {"manner": "100.00%"}}
How long is an hour
{"1": {"dist": "99.61%"}} #result should be time
So result purely depends on word occurrence. Is there any way to add word disambiguation(or anyother mean by which somekind of understanding could be brought) here?
I already checked Word sense disambiguation in NLTK Python
but here issue is identifying the main word in sentence, which differs in every sentence.
POS
(gives NN,JJ, on which sentence does not rely), NER
(highly dependent on capitalization, sometimes ner is also not disambiguating word like "early" ,"cost" in above sentence) I already tried, none of them helps.
**How long some times cosidered as time or distance. So based on sentence near by words, it should able to able understand what it is. Similarly for "how fast, "how come" "how early" [how + word] should be understable**
I am using nltk, scikit learn, python
Update :
- 40 classes (each with sentence belonging that class)
- Total data 300 Kb
Accuracy depends on query. sometimes very good >90%. Sometimes irrelevant class as a result. Depends on how query matches with dataset
Based on my understanding of the nature of your problem so far, I would suggest to use an unsupervised classification method, meaning that you have to use a set of rules for classification. By rules I mean if ... then ... else conditions. This is how some expert systems work. But, to add understanding of similar concepts and synonyms I suggest you to create an ontolgy. Ontologies are a sub-concept of Semantic web. Problems, such as yours are usually addressed by use of semantic web, let it be using RDF schemes or ontologies. You can learn more about semantic web here and about ontology here. My suggestion to you is not to go too deep into these fields, but just learn a general high-level idea, and then write your own ontology in a text file (avoid using any tools to build an ontology, because they take too much effort and your problem is easy enough not to need that effort). Now when you search on the web you will find some already existing ontologies, but in your case its better to write a small ontology of your own, use it to build the set of rules and you are good to go.
One note about your solution (using NB) on this kind of data is that you can simply have an overfiting problem which would result in low accuracy for some queries and high accuracy for some queries. I think its better to avoid using supervised learning for this problem. Let me know if you had further questions.
Edit 1: In this edit I would like to elaborate on the above answer: Lets say you want to build an unsupervised classifier. The data you currently have can be split into about 40 different classes. Because the sentences in your dataset are already somehow restricted and simple, you can simply do this by classifying those sentences based on a set of rules. Let me show you what I mean by this. Lets say a random sentence from your dataset is kept in variable
sentence
:So you got the idea what I meant. If you build a simple classifier in this way, and make it as precise as possible, you can easilly reach overall accuracies of almost 100%. Now, if you want to automate some complicated decision makings you need a form of knowledge base which I'd refer to as an ontology. if in a text file you'd have something like (I am writing it in plain English just to make it simple to understand; you can write it in a concise coded manner and its just a general example to show you what I mean):
Then, if you want to measure value, you already have a formula for it. You should decide if you need such formula based on your data. Or if you wanted to keep a synonyms list you can have them as a text file and alternately replace them. The overall implementation of the classifier for 40 classes in the way I mentioned requires a few days and since the method used is quite deterministic, you are destined to achive a very high accuracy of up to 100%.