From long to wide format with the same duplicates

Question

From long to wide format with the same duplicates

126 views Asked by demia At 30 September 2020 at 16:22

Trying this command:

library("spacyr")
library("dplyr", warn.conflicts = FALSE)

mytext <- data.frame(text = c("test text", "section 2 sending"), 
                     id = c(32,41))
df2 <- tidyr::separate_rows(mytext, text)

df3 <- data.frame(text = df2$text, id = df2$id)

dflemma <- spacy_parse(structure(df3$text, names = df3$id),
                       lemma = TRUE, pos = FALSE)  %>%
    mutate(id = doc_id) %>%
    group_by(id) %>%
    summarize(body = paste(lemma, collapse = " "))

the expected output is the long to wide format using the same id and separate the merge text with a space. Here the expected output

data.frame(text = c("test text", "section 2 send"), 
                     id = c(32,41)

However the command provide this error:

Error in process_document(x, multithread) : Docnames are duplicated.

Original Q&A

There are 2 answers

Duck On 30 September 2020 at 16:35

Try this base R solution on your df3:

#Code
dflemma <- aggregate(text~id,df3,function(x) paste(x,collapse = ' '))

Output:

  id              text
1 32         test text
2 41 section 2 sending

**ekoam** · Accepted Answer · 2020-09-30T17:42:19+00:00

You get this error because you separate each of your text phrases to words. You shouldn't do that. Consider the following code:

mytext <- data.frame(text = c("test text", "section 2 sending"), id = c(32,41))
dflemma <- 
  spacy_parse(structure(mytext$text, names = mytext$id), lemma = TRUE, pos = FALSE) %>% 
  group_by(id = doc_id) %>% 
  summarise(text = paste(lemma, collapse = " "))

Output

> dflemma
# A tibble: 2 x 2
  id    text          
  <chr> <chr>         
1 32    test text     
2 41    section 2 send

Update

If you have to do the separation, then you need to further modify your id column to ensure that each observation in it is unique. Later you can change those ids back at the group_by stage. Consider the following code.

mytext <- data.frame(text = c("test text", "section 2 sending"), id = c(32,41))
df2 <- tidyr::separate_rows(mytext, text) %>% group_by(id) %>% mutate(id = paste0(id, "-", seq_len(n())))
dflemma <- 
  spacy_parse(structure(df2$text, names = df2$id), lemma = TRUE, pos = FALSE) %>% 
  group_by(id = sub("(.+)-(.+)", "\\1", doc_id)) %>% 
  summarise(text = paste(lemma, collapse = " "))

TechQA.

From long to wide format with the same duplicates

There are 2 answers

Related Questions in R

Related Questions in DPLYR

Related Questions in QUANTEDA

Popular Questions

Popular Tags

Trending Questions