Splitting a document from a tm Corpus into multiple documents

2.9k views Asked by At

A bit of a bizarre question, is there a way to split corpus documents that have been imported using the Corpus function in tm into multiple documents that can then be reread in my Corpus as separate documents? For example if I used inspect(documents[1]) and had something like

`<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>`

`[[1]]`

`<<PlainTextDocument (metadata: 7)>>`

The quick brown fox jumped over the lazy dog

I think cats are really cool

I want to split after this line!!!

Hi mom

Purple is my favorite color

I want to split after this line!!!

Words

And stuff

and I want to split the document after the phrase "I want to split after this line!!!" appears, twice in this case, is that possible?

The end result would look like this after using inspect(documents)

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]

<<PlainTextDocument (metadata: 7)>>

The quick brown fox jumped over the lazy dog

I think cats are really cool

I want to split after this line!!!

[[2]]

<<PlainTextDocument (metadata: 7)>>

Hi mom

Purple is my favorite color

I want to split after this line!!!

[[3]]

<<PlainTextDocument (metadata: 7)>>

Words

And stuff

2

There are 2 answers

0
agstudy On BEST ANSWER

You can use strsplit to split your document , then recreate the corpus again :

Corpus(VectorSource(
          strsplit(as.character(documents[[1]]),  ## coerce to character
          "I want to split after this line!!!",   
          fixed=TRUE)[[1]]))       ## use fixed=T since you  have special
                                   ## characters in your separator  

To test this , we should first create a reproducible example :

documents <- Corpus(VectorSource(paste(readLines(textConnection("The quick brown fox jumped over the lazy dog
I think cats are really cool
I want to split after this line!!!
Hi mom
Purple is my favorite color
I want to split after this line!!!
Words
And stuff")),collapse='\n')))

Then applying previous solution:

split.docs <- Corpus(VectorSource(
  strsplit(as.character(documents[[1]]),  ## coerce to character
           "I want to split after this line!!!",   
           fixed=TRUE)[[1]]))  

Now inspect the solution :

inspect(split.docs)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
The quick brown fox jumped over the lazy dog
I think cats are really cool


[[2]]
<<PlainTextDocument (metadata: 7)>>

Hi mom
Purple is my favorite color


[[3]]
<<PlainTextDocument (metadata: 7)>>

Words
And stuff

looks that strsplit remove the separator :)

2
Ken Benoit On

Here's an even easier way, using the quanteda package:

require(quanteda)
segment(mytext, what = "other", delimiter = "I want to split after this line!!!")

This produces a list of length=1 (since it is designed to with multiple documents, if you wish) but you can always unlist() it if you just want a vector.

[[1]]
[1] "The quick brown fox jumped over the lazy dog\n\nI think cats are really cool\n\n"
[2] "\n    \nHi mom\n\nPurple is my favorite color\n\n"                               
[3] "\n    \nWords\n\nAnd stuff" 

This can be read back into a quanteda corpus using corpus(mytextSegmented) or a tm corpus for subsequent processing.