I am building a Machine Learning recommendation system for matching candidates with job postings.

I have two data sets. One contains job postings, the other one contains candidates. Job postings are originally retrieved in Swedish, from Swedish Unemployment Agency. I wrote a Python script to translate those job postings to English. Each job posting has a title and description, which is any sort of a text from one to 20 sentences. A description field contains everything from responsibilities, required skills and everything else that one job posting has.

On the other hand, the data set which contains candidates contains age, education, previous experience, knowledge, and skills for each candidate. Each candidate had up to six skills. All skills from the data set are collected and the data set is one hot encoded, meaning that I created a column for each possible skill and labeled it with 0 or 1, depending on the user's knowledge about the skill.

Now I need to prepare some data for training the model. I already split the candidates into a training and test set. I now must find a way to somehow extract keywords from job descriptions and compare them to the candidates' skills. Do you have any idea on how to do any of that, from extracting and defining keywords to cross-checking each candidate with each job posting?

Any help would be very appreciated!


1 Answers

Ashargin On

You want to do a recommender model.

I am going to assume that you have target data (candidates and job postings which you know are linked) because if you don't I can't see how you can do it (not with machine learning at least, all you can do is use your knowledge to write a rule (your brain has data from your life experience but the algorithm does not)).

This is probably going to be a matrix factorisation. I recommend you try a WNMF (weighted non-negative matrix factorisation) model.

To do that, there are 3 steps :

Try embedding layers on your candidates characteristics (one for each characteristic). Add those vectors together, this is going to be the representation of the candidate in the latent space.

Find a way to convert your job postings to vectors of the same length. You may want to check doc2vec to do that. It's from far the hardest step because it can be very complicated to transform text into a vector while preserving information. That's why it may also be a good idea to build that function which maps each document to a vector yourself, even if it's not machine learning (does the document include the word "computer"? Does it require high experience? Know what features are important and build a vector with these features).

Compute the dot product of the candidate and job posting vectors to get your prediction. Compare it to your target (1 if the candidate was linked to the job, 0 if not). Trying this on regression, your prediction will be comparable as the probability of the candidate and the job being matched.