How to convert text fields into numeric/vector space for a SVM in R Studio?

Question

How to convert text fields into numeric/vector space for a SVM in R Studio?

481 views Asked by UbuntuNewbie At 03 July 2017 at 21:21

I am attempting to train a Support Vector Machine to aid in the detection of similarity between strings. My training data consists of two text fields and a third field that contains 0 or 1 to indicate similarity. This last field was calculated with the help of an edit distance operation. I know that I need to convert the two text fields to numeric values before continuing. I am hoping to find out what is the best method to achieve this?

The training data looks like:

ID          MAKTX_Keyword       PH_Level_04_Keyword   Result
266325638   AMLODIPINE          AMLODIPINE              0
724712821   IRBESARTANHCTZ      IRBESARTANHCTZ          0
567428641   RABEPRAZOLE         RABEPRAZOLE             0
137472217   MIRTAZAPINE         MIRTAZAPINE             0
175827784   FONDAPARINUX        ARIXTRA                 1
456372747   VANCOMYCIN          VANCOMYCIN              0
653832438   BRUFEN              IBUPROFEN               1
917575539   POTASSIUM           POTASSIUM               0
222949123   DIOSMINHESPERIDIN   DIOSMINHESPERIDIN       0
892725684   IBUPROFEN           IBUPROFEN               0

I have been experimenting with the text2vec library, using this useful vignette as a guide. In doing so, I can presumably represent one of the fields in vector space.

But how can I use this library to manage both text fields at the same time?
Should I concatenate the two string fields into a single field?
Is text2vec the best approach to take?

The code that will be used to manage one of the fields:

library(text2vec)
library(data.table)

preproc_func = tolower
token_func = word_tokenizer

it_train = itoken(Train_PRDHA_String.df$MAKTX_Keyword, 
                  preprocessor = preproc_func, 
                  tokenizer = token_func, 
                  ids = Train_PRDHA_String.df$ID, 
                  progressbar = TRUE)
vocab = create_vocabulary(it_train)

vectorizer = vocab_vectorizer(vocab)
t1 = Sys.time()
dtm_train = create_dtm(it_train, vectorizer)
print(difftime(Sys.time(), t1, units = 'sec'))

dim(dtm_train)
identical(rownames(dtm_train), Train_PRDHA_String.df$id)

Original Q&A

There are 1 answers

**Dmitriy Selivanov** · Accepted Answer · 2017-07-05T12:36:32+00:00

One way to embed docs into the same space is to learn vocabulary from both columns:

preproc_func = tolower
token_func = word_tokenizer
union_txt = c(Train_PRDHA_String.df$MAKTX_Keyword, Train_PRDHA_String.df$PH_Level_04_Keyword)
it_train = itoken(union_txt, 
                  preprocessor = preproc_func, 
                  tokenizer = token_func, 
                  ids = Train_PRDHA_String.df$ID, 
                  progressbar = TRUE)
vocab = create_vocabulary(it_train)
vectorizer = vocab_vectorizer(vocab)

it1 = itoken(Train_PRDHA_String.df$MAKTX_Keyword, preproc_func, 
             token_func, ids = Train_PRDHA_String.df$ID)
dtm_train_1 = create_dtm(it1, vectorizer)

it2 = itoken(Train_PRDHA_String.df$PH_Level_04_Keyword, preproc_func, 
             token_func, ids = Train_PRDHA_String.df$ID)
dtm_train_2 = create_dtm(it2, vectorizer)

And after that you can combine them into a single matrix:

dtm_train = cbind(dtm_train_1, dtm_train_2)

However if you want to solve problem of duplicate detection I suggest to use char_tokenizer with ngram > 1 (say ngram = c(3, 3)). And check great stringdist package. I suppose you received Result with some manual human work. Because if it is just edit distance, algorithm will learn at most how edit distance works.

TechQA.

How to convert text fields into numeric/vector space for a SVM in R Studio?

There are 1 answers

Related Questions in R

Related Questions in SVM

Related Questions in DATA-MINING

Related Questions in TEXT2VEC

Related Questions in VECTOR-SPACE

Popular Questions

Popular Tags

Trending Questions