Text to Tag similarity word2vec

108 views Asked by At

Our users will give a 2 to 3 sentence description about their profession. Example user A (profile description): I am a data scientist living in Berlin, I like Japanese food and I am also interested in arts.

Then they also give a description about what kind of person they are looking for. Example user B (looking for description): I am looking for a data scientist, sales guy and an architect for my new home.

We want to match these on the basis that user A is a data scientist and user B is looking for a data scientist.

At first we required the user to hand select the tags they want to be matched on. And example of the kind of tags we provided:

Environmental Services
Events Services
Executive Office
Facilities Services
Human Resources
Information Services
Management Consulting
Outsourcing/Offshoring
Professional Training & Coaching
Security & Investigations
Staffing & Recruiting
Supermarkets
Wholesale
Energy & Mining
Mining & Metals
Oil & Energy
Utilities
Manufacturing
Automotive
Aviation & Aerospace
Chemicals
Defense & Space
Electrical & Electronic Manufacturing
Food Production
Industrial Automation
Machinery
Japanese Food
...

This system kinda works but we have a lot of tags and want to create more 'distant' relations.

So we need:

  • to know which parts are important, we could use POS-tagging for this, to extract the 'data science', 'japanese food' etc?
  • and then compare the vectors of each part; e.g. 'data science' with 'statistics' is a good match, and 'japanese food' and 'asian food' is a good match.
  • and set a threshold.
  • and this should result in a more convenient way of matching right?
3

There are 3 answers

0
Ivica On

To improve tag-based matching with a large set of tags, you can use part-of-speech tagging (POS tagging) to identify essential keywords within tags. These keywords, like "data science" or "Japanese food," serve as the focal points for matching. Convert these keywords into vector representations using techniques like Word2Vec or TF-IDF, which capture semantic meanings.

Next, compare the vectors of different tags to measure their similarity. Common similarity metrics like cosine similarity can quantify the relatedness of tags. Set a similarity threshold to determine which tags are considered relevant matches. Fine-tune this threshold to control the granularity of matches.

When users select tags, compare their chosen tags with others in your database. Present potential matches whose similarity scores exceed the threshold. Additionally, handle variations in tags using techniques like synonym mapping or stemming to ensure robust matching.

This approach allows for more nuanced and distant tag relations, resulting in a flexible and accurate matching system. While it may require computational resources, it greatly enhances the user experience by providing better tag-based recommendations.

0
inverted_index On

It's essential to first clarify what "importance" means in this context. From the given example, it appears that matching based on job title is the goal, but there could be other criteria like location, interests, etc. To extract relevant phrases or entities from the text, you could employ POS (Part-of-Speech) tagging or Named Entity Recognition (NER) tagging or even relation extraction (like what OpenIE package does) techniques.

The subsequent step involves matching instances based on the significant phrases or entities extracted. For this, semantic matching methods like Cosine Similarity can be used. However, before applying Cosine Similarity, you'll need to convert these phrases into vector representations. Starting with Word2Vec (W2V) or GloVe embeddings is a good idea, and you may also explore modern contextualized models like BERT or RoBERTa, which currently represent the state-of-the-art in representation learning.

For aspects like thresholding, a trial-and-error approach could be beneficial. Begin with a predefined similarity threshold, and then adjust this value based on the outcomes of your testing and the quality of matches observed. This iterative adjustment can help fine-tune the matching process to achieve better results.

0
gojomo On

Your conjectures are reasonable, but: you have to test them in a real system, subject to your data, goals, and other choices.

The quality of your results could vary quite a bit on your choices of how to define 'entities', and other data prep/enhancement.

Ultimately you're asking a sort of classification or ranking question:

"Given this [free-text-description-of-wants], how likely us another [free-text-description-of-offering] to make some user satisfied?"

(It's classification if you're focusing on: would downstream evaluation consider it binary 'good-enough' or 'no-good'. It's ranking/scoring if you want to report some sense of relative-appropriateness.)

Something simple like various kinds of mere semantic-similarity between the 2 texts migth be valid way to go from nothing to something: bootstrap a little advantage.

But it's likely the true relationship to successful downstream matching/recommended is more complicated than mere textual-similarity. (For example, the best matches may conventionally be described with different sets of words than are used for free-form specifications, in a realtionship that people understand, and may be learnable, but isn't mere word-correlations.)

Thus you might want to enhance your texts with extra calculated features, and generically train a system to score candidate pairs of (need-text-with-all-features), (offer-text-with-all-features) as better or worse, based on your other (ad hoc or formally-acquired) 'gold standard' examples of what you want the system to do.

There are boundless ways to iteratively try & improve such a system - what makes sense depends on your data, & effective budgets of skills/attention/compute/etc.