I want to calculate text similarity by using only the words of a specific POS tag. Currently I am calculating similarity using cosine method but it does not take into account POS tagging.
A <- data.frame(name = c(
"X-ray right leg arteries",
"consultation of gynecologist",
"x-ray leg arteries",
"x-ray leg with 20km distance"
), stringsAsFactors = F)
B <- data.frame(name = c(
"X-ray left leg arteries",
"consultation (inspection) of gynecalogist",
"MRI right leg arteries",
"X-ray right leg arteries with special care"
), stringsAsFactors = F)
corp1 <- corpus(A, text_field = "name")
corp2 <- corpus(B, text_field = "name")
docnames(corp1) <- paste("A", seq_len(ndoc(corp1)), sep = ".")
docnames(corp2) <- paste("B", seq_len(ndoc(corp2)), sep = ".")
dtm3 <- rbind(dfm(corp1, ngrams=2), dfm(corp2, ngrams=2))
cosines <- lapply(docnames(corp2),
function(x) textstat_simil(dtm3[c(x, docnames(corp1)), ],
method = "cosine",
selection = x)[-1, , drop = FALSE])
do.call(cbind, cosines)
In the above example, "X-ray right leg arteries" should not be mapped to "MRI right leg arteries" as these are two different categories of services. Unfortunately, I don't have explicit categorization of services. I only have services text. Is it possible by using POS tagging I can assign more importance to these words - "X-ray", "consultation", "leg" and "arteries". The services mentioned in the code are just a sample. In reality, I have more than 10K services. I explored udpipe package for PoS tagging but didn't get much success.
In order to do pos tagging with udpipe, you can do as follows (based on your example data A & B).
If you want to calculate similarities based on a document term matrix of the lemma's, do as follows (uses
sim2
fromtext2vec
R package)If you also want to add ngrams of nouns in the game, do as follows. Extract nouns following one another, create a document/term/matrix of this new compound term and combine it with the exising document term matrix in order to easily run document similarities.