semantic matching strings - using word2vec or s-match?

1.2k views Asked by At

I have this problem of matching two strings for 'more general', 'less general', 'same meaning', 'opposite meaning' etc.

The strings can be from any domain. Assume that the strings can be from people's emails.

To give an example,

String 1 = "movies"
String 2 = "Inception"

Here I should know that Inception is less general than movies (sort of is-a relationship)

String 1 = "Inception"
String 2 = "Christopher Nolan"

Here I should know that Inception is less general than Christopher Nolan

String 1 = "service tax"
String 2 = "service tax 2015"

At a glance it appears to me that S-match will do the job. But I am not sure if S-match can be made to work on knowledge bases other than WordNet or GeoWordNet (as mentioned in their page).

If I use word2vec or dl4j, I guess it can give me the similarity scores. But does it also support telling a string is more general or less general than the other?

But I do see word2vec can be based on a training set or large corpus like wikipedia etc.

Can some one throw light on the way to go forward?

1

There are 1 answers

1
Mehdi On BEST ANSWER

The current usage of machine learning methods such as word2vec and dl4j for modelling words are based on distributional hypothesis. They train models of words and phrases based on their context. There is no ontological aspects in these word models. At its best trained case a model based on these tools can say if two words can appear in similar contexts. That is how their similarity measure works.

The Mikolov papers (a, b and c) which suggests that these models can learn "Linguistic Regularity" doesn't have any ontological test analysis, it only suggests that these models are capable of predicting "similarity between members of the word pairs". This kind of prediction doesn't help your task. These models are even incapable of recognising similarity in contrast with relatedness (e.g. read this page SimLex test set).

I would say that you need an ontological database to solve your problem. More specifically about your examples, it seems for String 1 and String 2 in your examples:

String 1 = "a"
String 2 = "b"

You are trying to check entailment relations in sentences:

(1) "c is b"

(2) "c is a"

(3) "c is related to a".

Where:

(1) entails (2)

or

(1) entails (3)

In your two first examples, you can probably use semantic knowledge bases to solve the problem. But your third example will probably need a syntactical parsing before understanding the difference between two phrases. For example, these phrases:

"men"

"all men"

"tall men"

"men in black"

"men in general"

It needs a logical understanding to solve your problem. However, you can analyse that based on economy of language, adding more words to a phrase usually makes it less general. Longer phrases are less general comparing to shorter phrases. It doesn't give you a precise tool to solve the problem, but it can help to analyse some phrases without special words such as all, general or every.