I am attempting to train a Support Vector Machine to aid in the detection of similarity between strings. My training data consists of two text fields and a third field that contains 0 or 1 to indicate similarity. This last field was calculated with the help of an edit distance operation. I know that I need to convert the two text fields to numeric values before continuing. I am hoping to find out what is the best method to achieve this?
The training data looks like:
ID MAKTX_Keyword PH_Level_04_Keyword Result
266325638 AMLODIPINE AMLODIPINE 0
724712821 IRBESARTANHCTZ IRBESARTANHCTZ 0
567428641 RABEPRAZOLE RABEPRAZOLE 0
137472217 MIRTAZAPINE MIRTAZAPINE 0
175827784 FONDAPARINUX ARIXTRA 1
456372747 VANCOMYCIN VANCOMYCIN 0
653832438 BRUFEN IBUPROFEN 1
917575539 POTASSIUM POTASSIUM 0
222949123 DIOSMINHESPERIDIN DIOSMINHESPERIDIN 0
892725684 IBUPROFEN IBUPROFEN 0
I have been experimenting with the text2vec library, using this useful vignette as a guide. In doing so, I can presumably represent one of the fields in vector space.
- But how can I use this library to manage both text fields at the same time?
- Should I concatenate the two string fields into a single field?
- Is text2vec the best approach to take?
The code that will be used to manage one of the fields:
library(text2vec)
library(data.table)
preproc_func = tolower
token_func = word_tokenizer
it_train = itoken(Train_PRDHA_String.df$MAKTX_Keyword,
preprocessor = preproc_func,
tokenizer = token_func,
ids = Train_PRDHA_String.df$ID,
progressbar = TRUE)
vocab = create_vocabulary(it_train)
vectorizer = vocab_vectorizer(vocab)
t1 = Sys.time()
dtm_train = create_dtm(it_train, vectorizer)
print(difftime(Sys.time(), t1, units = 'sec'))
dim(dtm_train)
identical(rownames(dtm_train), Train_PRDHA_String.df$id)
One way to embed docs into the same space is to learn vocabulary from both columns:
And after that you can combine them into a single matrix:
However if you want to solve problem of duplicate detection I suggest to use
char_tokenizer
withngram > 1
(sayngram = c(3, 3)
). And check great stringdist package. I suppose you receivedResult
with some manual human work. Because if it is just edit distance, algorithm will learn at most how edit distance works.