I am in need of some help with TM with one task. I have a csv file with ~300 rows and 42 variables with some variables having NA values. I want to use TM to clean this file up before I load the data to a NLP application. Specifically, I want to remove stopwords, numbers and punctuation. Stemming is probably not required. The last five columns are mainly the ones requiring clean up. Importantly, the NLP application allows input as a table and that is how I would like to have the input structured.
Ideally, I would like to use TM to convert the data frame to a corpus, perform the clean up, and then return the cleaned up text data to the structure fo the csv file to use as input to the NLP app.
I am testing my ability to do this task with text data in a smaller csv file. It is 7 rows by 42 variables.
Using RStudio, I have done the following
Tiz.corpus <- Corpus(DataframeSource(Tiz))
inspect(Tiz.corpus) A corpus with 7 text documents
The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID
....
At this point I did the following...
Tiz.corpus <- tm_map(Tiz.corpus, tolower) # Make lowercase
Tiz.corpus <- tm_map(Tiz.corpus, removePunctuation, preserve_intra_word_dashes = TRUE)
Tiz.corpus <- tm_map(Tiz.corpus, removeWords, stopwords("english")) # Remove stopwords
So far so good. I then tried...
writeCorpus(Tiz.corpus)
What i get is the following is 7 documents with contents like this...
132884
2
2
2
1
2
na
na
na
3
3
3
2
na
na
na
na
na
na
na
2
1
na
na
2
2
2
2
2
2
2
2
2
2
2
2
na
2
7
4
3
2
I am not sure what to do at this point to recover my text data and have it in the structure of the original csv file.
Is TM the wrong tool for this job?
Jose