Splitting a document from a tm Corpus into multiple documents

Question

Splitting a document from a tm Corpus into multiple documents

2.9k views Asked by src471 At 17 June 2015 at 20:31

A bit of a bizarre question, is there a way to split corpus documents that have been imported using the Corpus function in tm into multiple documents that can then be reread in my Corpus as separate documents? For example if I used inspect(documents[1]) and had something like

`<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>`

`[[1]]`

`<<PlainTextDocument (metadata: 7)>>`

The quick brown fox jumped over the lazy dog

I think cats are really cool

I want to split after this line!!!

Hi mom

Purple is my favorite color

I want to split after this line!!!

Words

And stuff

and I want to split the document after the phrase "I want to split after this line!!!" appears, twice in this case, is that possible?

The end result would look like this after using inspect(documents)

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]

<<PlainTextDocument (metadata: 7)>>

The quick brown fox jumped over the lazy dog

I think cats are really cool

I want to split after this line!!!

[[2]]

<<PlainTextDocument (metadata: 7)>>

Hi mom

Purple is my favorite color

I want to split after this line!!!

[[3]]

<<PlainTextDocument (metadata: 7)>>

Words

And stuff

Original Q&A

There are 2 answers

Ken Benoit On 18 June 2015 at 21:15

Here's an even easier way, using the quanteda package:

require(quanteda)
segment(mytext, what = "other", delimiter = "I want to split after this line!!!")

This produces a list of length=1 (since it is designed to with multiple documents, if you wish) but you can always unlist() it if you just want a vector.

[[1]]
[1] "The quick brown fox jumped over the lazy dog\n\nI think cats are really cool\n\n"
[2] "\n    \nHi mom\n\nPurple is my favorite color\n\n"                               
[3] "\n    \nWords\n\nAnd stuff"

This can be read back into a quanteda corpus using corpus(mytextSegmented) or a tm corpus for subsequent processing.

**agstudy** · Accepted Answer · 2015-06-17T20:47:44+00:00

You can use strsplit to split your document , then recreate the corpus again :

Corpus(VectorSource(
          strsplit(as.character(documents[[1]]),  ## coerce to character
          "I want to split after this line!!!",   
          fixed=TRUE)[[1]]))       ## use fixed=T since you  have special
                                   ## characters in your separator

To test this , we should first create a reproducible example :

documents <- Corpus(VectorSource(paste(readLines(textConnection("The quick brown fox jumped over the lazy dog
I think cats are really cool
I want to split after this line!!!
Hi mom
Purple is my favorite color
I want to split after this line!!!
Words
And stuff")),collapse='\n')))

Then applying previous solution:

split.docs <- Corpus(VectorSource(
  strsplit(as.character(documents[[1]]),  ## coerce to character
           "I want to split after this line!!!",   
           fixed=TRUE)[[1]]))

Now inspect the solution :

inspect(split.docs)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
The quick brown fox jumped over the lazy dog
I think cats are really cool


[[2]]
<<PlainTextDocument (metadata: 7)>>

Hi mom
Purple is my favorite color


[[3]]
<<PlainTextDocument (metadata: 7)>>

Words
And stuff

looks that strsplit remove the separator :)

TechQA.

Splitting a document from a tm Corpus into multiple documents

There are 2 answers

Related Questions in REGEX

Related Questions in R

Related Questions in SPLIT

Related Questions in TM

Related Questions in TEXT-ANALYSIS

Popular Questions

Popular Tags

Trending Questions