From long to wide format with the same duplicates

147 views Asked by At

Trying this command:

library("spacyr")
library("dplyr", warn.conflicts = FALSE)

mytext <- data.frame(text = c("test text", "section 2 sending"), 
                     id = c(32,41))
df2 <- tidyr::separate_rows(mytext, text)

df3 <- data.frame(text = df2$text, id = df2$id)

dflemma <- spacy_parse(structure(df3$text, names = df3$id),
                       lemma = TRUE, pos = FALSE)  %>%
    mutate(id = doc_id) %>%
    group_by(id) %>%
    summarize(body = paste(lemma, collapse = " "))

the expected output is the long to wide format using the same id and separate the merge text with a space. Here the expected output

data.frame(text = c("test text", "section 2 send"), 
                     id = c(32,41)

However the command provide this error:

Error in process_document(x, multithread) : Docnames are duplicated.
2

There are 2 answers

2
ekoam On BEST ANSWER

You get this error because you separate each of your text phrases to words. You shouldn't do that. Consider the following code:

mytext <- data.frame(text = c("test text", "section 2 sending"), id = c(32,41))
dflemma <- 
  spacy_parse(structure(mytext$text, names = mytext$id), lemma = TRUE, pos = FALSE) %>% 
  group_by(id = doc_id) %>% 
  summarise(text = paste(lemma, collapse = " "))

Output

> dflemma
# A tibble: 2 x 2
  id    text          
  <chr> <chr>         
1 32    test text     
2 41    section 2 send

Update

If you have to do the separation, then you need to further modify your id column to ensure that each observation in it is unique. Later you can change those ids back at the group_by stage. Consider the following code.

mytext <- data.frame(text = c("test text", "section 2 sending"), id = c(32,41))
df2 <- tidyr::separate_rows(mytext, text) %>% group_by(id) %>% mutate(id = paste0(id, "-", seq_len(n())))
dflemma <- 
  spacy_parse(structure(df2$text, names = df2$id), lemma = TRUE, pos = FALSE) %>% 
  group_by(id = sub("(.+)-(.+)", "\\1", doc_id)) %>% 
  summarise(text = paste(lemma, collapse = " "))
1
Duck On

Try this base R solution on your df3:

#Code
dflemma <- aggregate(text~id,df3,function(x) paste(x,collapse = ' '))

Output:

  id              text
1 32         test text
2 41 section 2 sending