Word sense disambiguation in classification

878 views Asked by At

I am classifying the input sentence to different category. like time, distance, speed, location etc

I trained classifier using MultinomialNB.

Classifier considers mainly tf as feature, I also tried with considering sentence structure (using 1-4 grams)

Using multinomialNB with alpha = 0.001 this is the result for few queries

what is the value of Watch
{"1": {"other": "33.27%"}, "2": {"identity": "25.40%"}, "3": {"desc": "16.20%"}, "4": {"country": "9.32%"}}
what is the price of Watch
{"1": {"other": "25.37%"}, "2": {"money": "23.79%"}, "3": {"identity": "19.37%"}, "4": {"desc": "12.35%"}, "5": {"country": "7.11%"}}
what is the cost of Watch
{"1": {"money": "48.34%"}, "2": {"other": "17.20%"}, "3": {"identity": "13.13%"}, "4": {"desc": "8.37%"}} #for above two query also result should be money
How early can I go to mumbai
{"1": {"manner": "97.77%"}}  #result should be time
How fast can I go to mumbai
{"1": {"speed": "97.41%"}}
How come can I go to mumbai
{"1": {"manner": "100.00%"}}
How long is a meter
{"1": {"period": "90.74%"}, "2": {"dist": "9.26%"}}  #better result should be distance

Using multinomialNW with considering ngram (1-4)

what is the value of Watch
{"1": {"other": "33.27%"}, "2": {"identity": "25.40%"}, "3": {"desc": "16.20%"}, "4": {"country": "9.32%"}}
what is the price of Watch
{"1": {"other": "25.37%"}, "2": {"money": "23.79%"}, "3": {"identity": "19.37%"}, "4": {"desc": "12.35%"}, "5": {"country": "7.11%"}}
what is the cost of Watch
{"1": {"money": "48.34%"}, "2": {"other": "17.20%"}, "3": {"identity": "13.13%"}, "4": {"desc": "8.37%"}}   # for above two query also result should be money
How early can I go to mumbai
{"1": {"manner": "97.77%"}}  #result should be time
How fast can I go to mumbai
{"1": {"speed": "97.41%"}}
How come can I go to mumbai
{"1": {"manner": "100.00%"}}
How long is an hour
{"1": {"dist": "99.61%"}}   #result should be time

So result purely depends on word occurrence. Is there any way to add word disambiguation(or anyother mean by which somekind of understanding could be brought) here?

I already checked Word sense disambiguation in NLTK Python

but here issue is identifying the main word in sentence, which differs in every sentence.

POS (gives NN,JJ, on which sentence does not rely), NER(highly dependent on capitalization, sometimes ner is also not disambiguating word like "early" ,"cost" in above sentence) I already tried, none of them helps.

**How long some times cosidered as time or distance. So based on sentence near by words, it should able to able understand what it is. Similarly for "how fast, "how come" "how early" [how + word] should be understable** 

I am using nltk, scikit learn, python

Update :

  • 40 classes (each with sentence belonging that class)
  • Total data 300 Kb

Accuracy depends on query. sometimes very good >90%. Sometimes irrelevant class as a result. Depends on how query matches with dataset

3

There are 3 answers

1
user823743 On

Based on my understanding of the nature of your problem so far, I would suggest to use an unsupervised classification method, meaning that you have to use a set of rules for classification. By rules I mean if ... then ... else conditions. This is how some expert systems work. But, to add understanding of similar concepts and synonyms I suggest you to create an ontolgy. Ontologies are a sub-concept of Semantic web. Problems, such as yours are usually addressed by use of semantic web, let it be using RDF schemes or ontologies. You can learn more about semantic web here and about ontology here. My suggestion to you is not to go too deep into these fields, but just learn a general high-level idea, and then write your own ontology in a text file (avoid using any tools to build an ontology, because they take too much effort and your problem is easy enough not to need that effort). Now when you search on the web you will find some already existing ontologies, but in your case its better to write a small ontology of your own, use it to build the set of rules and you are good to go.

One note about your solution (using NB) on this kind of data is that you can simply have an overfiting problem which would result in low accuracy for some queries and high accuracy for some queries. I think its better to avoid using supervised learning for this problem. Let me know if you had further questions.

Edit 1: In this edit I would like to elaborate on the above answer: Lets say you want to build an unsupervised classifier. The data you currently have can be split into about 40 different classes. Because the sentences in your dataset are already somehow restricted and simple, you can simply do this by classifying those sentences based on a set of rules. Let me show you what I mean by this. Lets say a random sentence from your dataset is kept in variable sentence :

if sentence contains "long":
    if it also contains "meter":
         print "it is distance"
    elif ...
    .
    .
    .
    else:
         print "it is period"
if sentence contains "fast":
    print "it is speed or time"
if sentence contains "early":
         print "it is time"

So you got the idea what I meant. If you build a simple classifier in this way, and make it as precise as possible, you can easilly reach overall accuracies of almost 100%. Now, if you want to automate some complicated decision makings you need a form of knowledge base which I'd refer to as an ontology. if in a text file you'd have something like (I am writing it in plain English just to make it simple to understand; you can write it in a concise coded manner and its just a general example to show you what I mean):

"Value" depends 60% on "cost (measured with money)", 20% on "durability (measured in time)", 20% on "ease of use (measured in quality)"

Then, if you want to measure value, you already have a formula for it. You should decide if you need such formula based on your data. Or if you wanted to keep a synonyms list you can have them as a text file and alternately replace them. The overall implementation of the classifier for 40 classes in the way I mentioned requires a few days and since the method used is quite deterministic, you are destined to achive a very high accuracy of up to 100%.

2
tripleee On

Attempting to deduce semantics purely by looking at individual words out of context is not going to take you very far. In your "watch" examples, the only term which actually indicates that you have "money" semantics is the one you hope to disambiguate. What other information is there in the sentence to help you reach that conclusion, as a human reader? How would you model that knowledge? (A traditional answer would reason about your perception of watches as valuable objects, or something like that.)

Having said that, you might want to look at Wordnet synsets as a possibly useful abstraction. At least then you could say that "cost", "price", and "value" are related somehow, but I suppose the word-level statistics you have already calculated show that they are not fully synonymous, and the variation you see basically accounts for that fact (though your input size sounds kind of small for adequately covering variances of usage patterns for individual word forms).

Another hint could be provided by part of speech annotation. If you know that "value" is used as a noun, that (to my mind, at least) narrows the meaning to "money talk", whereas the verb reading is much less specifically money-oriented ("we value your input", etc). In your other examples, it is harder to see whether it would help at all. Perhaps you could perform a quick experiment with POS-annotated input and see whether it makes a useful difference. (But then POS is not always possible to deduce correctly, for much the same reasons you are having problems now.)

The sentences you show as examples are all rather simple. It would not be very hard to write a restricted parser for a small subset of English where you could actually start to try to make some sense of the input grammatically, if you know that your input will generally be constrained to simple questions with no modal auxiliaries etc.

(Incidentally, I'm not sure "how come can I go to Mumbai" is "manner", if it is grammatical at all. Strictly speaking, you should have subordinate clause word order here. I would understand it to mean roughly "Why is it that I can go to Mumbai?")

0
alexis On

Your result "depends purely on word occurrence" because that is the kind of features your code produces. If you feel that this approach is not sufficient for your problem, you need to decide what other information you need to extract. Express it as features, i.e. as key-value pairs, add them to your dictionary, and pass them to the classifier exactly as you do now. To avoid overtraining you should probably limit the number of ngrams you do include in the dictionary; e.g., keep only the frequent ones, or the ones containing certain keywords you consider relevant, or whatever.

I'm not quite sure what classification you mean by "distance, speed, location, **etc.", but you've mentioned most of the tools I'd think to use for something like this. If they didn't work to your satisfaction, think about more specific ways to detect properties that might be relevant; then express them as features so they can contribute to classification along with the "bag of words" features you have already. (But note that many experts in the field get acceptable results using just the bag-of-words approach).