Alternatives to TF-IDF and Cosine Similarity (comparing documents with different formats)

3.5k views Asked by At

I've been working on a small, personal project which takes a user's job skills and suggests the most ideal career for them based on those skills. I use a database of job listings to achieve this. At the moment, the code works as follows:

1) Process the text of each job listing to extract skills that are mentioned in the listing

2) For each career (e.g. "Data Analyst"), combine the processed text of the job listings for that career into one document

3) Calculate the TF-IDF of each skill within the career documents

After this, I'm not sure which method I should use to rank careers based on a list of a user's skills. The most popular method that I've seen would be to treat the user's skills as a document as well, then to calculate the TF-IDF for the skill document, and use something like cosine similarity to calculate the similarity between the skill document and each career document.

This doesn't seem like the ideal solution to me, since cosine similarity is best used when comparing two documents of the same format. For that matter, TF-IDF doesn't seem like the appropriate metric to apply to the user's skill list at all. For instance, if a user adds additional skills to their list, the TF for each skill will drop. In reality, I don't care what the frequency of the skills are in the user's skills list -- I just care that they have those skills (and maybe how well they know those skills).

It seems like a better metric would be to do the following:

1) For each skill that the user has, calculate the TF-IDF of that skill in the career documents

2) For each career, sum the TF-IDF results for all of the user's skill

3) Rank career based on the above sum

Am I thinking along the right lines here? If so, are there any algorithms that work along these lines, but are more sophisticated than a simple sum? Thanks for the help!

2

There are 2 answers

1
Alikbar On BEST ANSWER

The second approach you explained will work. But there are better ways to solve this kind of problem. At first you should know a little bit about language models and leave the vector space model. In the second step based on your kind of problem that is similar to expert finding/profiling you should learn a baseline language model framework to implement a solution. You can implement A language modeling framework for expert finding with a little changes so that the formulas can be adapted to your problem. Also reading On the assessment of expertise profiles will give you a better understanding of expert profiling with the framework above. you can find some good ideas, resources and projects on expert finding/profiling at Balog's blog.

0
hardyVeles On

I would take SSRM [1] approach to expand query (job documents) using WordNet (extracted database [2]) as semantic lexicon - so you are not constrained only to direct word-vs-word matches. SSRM has its own similarity measure (I believe the paper is open-access, if not, check this: http://blog.veles.rs/document-similarity-computation-models-literature-review/, there are many similarity computation models listed). Alternativly, and if your corpus is big enough, you might try LSA/LSI[3,4] (also covered on the page) - without using external lexicon. But, if it is on English, WordNet's semantic graph is really rich in all directions (hyponims, synonims, hypernims... concepts/SinSet).

The bottom line: I would avoid simple SVM/TF-IDF for such concrete domain. I measured really serious margin of SSRM, over TF-IDF/VSM (measured as macro-average F1, 5-class single label classification, narrow domain).

[1] A. Hliaoutakis, G. Varelas, E. Voutsakis, E.G.M. Petrakis, E. Milios, Information Retrieval by Semantic Similarity, Int. J. Semant. Web Inf. Syst. 2 (2006) 55–73. doi:10.4018/jswis.2006070104.

[2] J.E. Petralba, An extracted database content from WordNet for Natural Language Processing and Word Games, in: 2014 Int. Conf. Asian Lang. Process., 2014: pp. 199–202. doi:10.1109/IALP.2014.6973502.

[3] P.W. Foltz, Latent semantic analysis for text-based research, Behav. Res. Methods, Instruments, Comput. 28 (1996) 197–202. doi:10.3758/BF03204765.

[4] A. Kashyap, L. Han, R. Yus, J. Sleeman, T. Satyapanich, S. Gandhi, T. Finin, Robust semantic text similarity using LSA, machine learning, and linguistic resources, Springer Netherlands, 2016. doi:10.1007/s10579-015-9319-2.