A bit of a bizarre question, is there a way to split corpus documents that have been imported using the Corpus function in tm into multiple documents that can then be reread in my Corpus as separate documents? For example if I used
inspect(documents[1])
and had something like
`<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>`
`[[1]]`
`<<PlainTextDocument (metadata: 7)>>`
The quick brown fox jumped over the lazy dog
I think cats are really cool
I want to split after this line!!!
Hi mom
Purple is my favorite color
I want to split after this line!!!
Words
And stuff
and I want to split the document after the phrase "I want to split after this line!!!" appears, twice in this case, is that possible?
The end result would look like this after using inspect(documents)
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
The quick brown fox jumped over the lazy dog
I think cats are really cool
I want to split after this line!!!
[[2]]
<<PlainTextDocument (metadata: 7)>>
Hi mom
Purple is my favorite color
I want to split after this line!!!
[[3]]
<<PlainTextDocument (metadata: 7)>>
Words
And stuff
You can use
strsplit
to split your document , then recreate the corpus again :To test this , we should first create a reproducible example :
Then applying previous solution:
Now inspect the solution :
looks that
strsplit
remove the separator :)