Make udpipe_annotate() faster

1.5k views Asked by At

I am currently working on a Text Mining document, where I want to abstract relevant keywords from my text (note that I have got many, many text documents).

I am using the udpipe package. A great Vignette is online on (http://bnosac.be/index.php/blog/77-an-overview-of-keyword-extraction-techniques). Everything works, but when I run the code, the part

x <- udpipe_annotate(ud_model, x = comments$feedback)

is really, really slow (especially when you have a lot of text). Is there anyone who have an idea how I get this part faster? a workaround is of course fine.

library(udpipe)
library(textrank)
## First step: Take the Spanish udpipe model and annotate the text. Note: this takes about 3 minutes

data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "es")
ud_model <- udpipe_download_model(language = "spanish")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = comments$feedback) # This part is really, really slow 
x <- as.data.frame(x)

Many thanks in advance!

3

There are 3 answers

1
phiver On BEST ANSWER

I'm adding an answer based on the future API. This works independent of which OS (Windows, mac, or linux flavour) you are using.

The future.apply package has all parallel alternatives for the base *apply family. The rest of the code is based on the answer from @jwijffels. Only difference is that I use data.table in the annotate_splits function.

library(udpipe)
library(data.table)

data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "es")
ud_model <- udpipe_download_model(language = "spanish", overwrite = F)
ud_es <- udpipe_load_model(ud_model)


# returns a data.table
annotate_splits <- function(x, file) {
  ud_model <- udpipe_load_model(file)
  x <- as.data.table(udpipe_annotate(ud_model, 
                                     x = x$feedback,
                                     doc_id = x$id))
  return(x)
}


# load parallel library future.apply
library(future.apply)

# Define cores to be used
ncores <- 3L
plan(multiprocess, workers = ncores)

# split comments based on available cores
corpus_splitted <- split(comments, seq(1, nrow(comments), by = 100))

annotation <- future_lapply(corpus_splitted, annotate_splits, file = ud_model$file_model)
annotation <- rbindlist(annotation)
3
AudioBubble On

The R package udpipe uses the UDPipe version 1.2 C++ library. Annotation speeds are detailed in the paper (see table Table 8 in https://doi.org/10.18653/v1/K17-3009). If you want to speed it up, run it in parallel as annotations are trivially paralleliseable.

Example below parallelises across 16 cores using parallel::mclapply giving you a 16x speedup for large corpora if you have 16 cores of course. You can use any parallelisation framework you have, below I used the parallel package - if you are on Windows you would need e.g. parallel::parLapply but nothing stops you from using other parallel options (snow / multicore / future / foreach /...) to annotate in parallel.

library(udpipe)
library(data.table)
library(parallel)
data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "fr")
ud_model <- udpipe_download_model(language = "french-partut")

annotate_splits <- function(x, file) {
  model <- udpipe_load_model(file)
  x <- udpipe_annotate(model, x = x$feedback, doc_id = x$id, tagger = "default", parser = "default")
  as.data.frame(x, detailed = TRUE)
}

corpus_splitted <- split(comments, seq(1, nrow(comments), by = 100))
annotation <- mclapply(corpus_splitted, FUN = function(x, file){
  annotate_splits(x, file) 
}, file = ud_model$file_model, mc.cores = 16)
annotation <- rbindlist(annotation)

Note that udpipe_load_model also takes some time, so probably a better strategy is parallelise it across the number of cores you have on your machine instead of in chunks of 100 as I showed above.

0
NovaEthos On

You can also accomplish this using the furrr and future libraries, which have the added bonus of a progress bar.

One thing I am confused about in the two other answers is their implementation of udpipe_load_model within their functions. You can first load the model outside of the function once, that way the function doesn't have to load the model each time it runs.

library(udpipe)
library(future)
library(furrr)
data(brussels_reviews)

comments <- subset(brussels_reviews, language %in% "es")
downloaded_model <- udpipe_download_model(language = "spanish", overwrite = FALSE)
model <- udpipe_load_model(downloaded_model)

annotate_splits <- function(text) {
  anno <- udpipe_annotate(model, x = text$feedback, doc_id = text$id, tagger = "default", parser = "default")
  x <- as.data.frame(anno, detailed = TRUE)
  return(x)
}

split_corpus <- split(comments, seq(1, nrow(comments), by = 100))

#recommend setting workers equal to number of your computer's cores
plan(multisession, workers = 2) 
dfs <- future_map(split_corpus, annotate_splits, .progress = TRUE)

annotated_df <- dplyr::bind_rows(dfs)