How do I identify what is causing thrashing in my R function?

152 views Asked by At

I wrote a function to anonymize names in a data frame given some key and it comes to a crawl once it gets to anonymizing very many names but I don't understand why.

The data frame in question is a set of 4733 tweets collected through the Twitter API where each row is a tweet with 32 columns of data. The names are to be anonymized regardless of which row they show up in, so I'd like to not limit the function to looking at only a couple of those 32 columns.

The key is a data frame containing 211121 pairs of real and fake names, both real and fake being unique in the data frame. The function slows down immensely after about 100k names are anonymized.

The function looks like the following:

pseudonymize <- function(df, key) {
  for(name in key$realNames) {
    df <- as.data.frame(apply(df, 2, function(column) gsub(name, key[key$realNames == name, 2], column)))
  }
}

Is there some obvious thing here that would cause the slowing? I'm not at all experienced with optimizing code for speed.

EDIT1:

Here are a few lines from the data frame to be anonymized.

"https://twitter.com/__jgil/statuses/825559753447313408","__jgil",0.000576911235261567,756,4,13,17,7,16,23,10,0.28166915052161,0.390123456790124,0.00271311644806025,0.474529795261862,0.00641025649383664,"@jadahung20 GIRL I am tooooooo salty tonight lolll","lolll","adjoint","anglais","indefini","anglais","anglais","non","iPhone, Twitter",4057,214,241,"Canada","Nouvelle-Ecosse","Middleton","indefini","Shari"
"https://twitter.com/__paigewhite/statuses/827988259573788673","__paigewhite",0,1917,0,8,8,0,9,9,16,0.143476044852192,0.162056634159209,0.000172947386274259,0,0,"@abbytutty_ i miss emily lololol _Ù÷â_Ù÷É","lololol","adjoint","anglais","indefini","anglais","anglais","non","iPhone, Twitter",8366,392,661,"Canada","Nouvelle-Ecosse","indefini","indefini","Shari"
"https://twitter.com/_brookehynes/statuses/821022926287884288","_brookehynes",0,1917,1,6,7,1,7,8,1,1,1,0.000196850793912616,0.00393656926735126,0.200000002980232,"@tdesj3 @belle lol yea doubt it.","lol","adjoint","indefini","anglais","anglais","anglais","non","iPhone, Twitter",1184,87,70,"Canada","Nouvelle-Ecosse","Halifax","indefini","Shari"

Here are a few lines from the key.

"","realNames","fakeNames"
"1","________","Tajid_Pinkley"
"2","____________aho","Monica_Yujiri"
"3","___________ass","Alexander_Garay-Grajeda"

EDIT2:

I've simplified the DF down to only the two columns that would need anonymizing, and this made things much faster, but it still putters out after doing about 155k names.

As requested in the comments, here's the dput() output for the first three lines of the DF that's to be anonymized.

structure(list(
  utilisateur = c("___Yeliab", "__courtlezz", "__courtlezz"),
  texte = c("@EmilyIsPro ik lol", "@NikkiErica21 there was a sighting in sunset ridge too. Keep Winnie and bob safe lol", "@NikkiErica21 lol yes _Ã\231։")
  ),
  row.names = c(NA, 3L),
  class = "data.frame")

And here's the dput() for the first three lines of the key.

structure(list(
  realNames = c("________", "____________aho", "___________ass"),
  fakeNames = c("Abhinav_Chang", "Caleb_Dunn-Sparks", "Taryn_Hunzicker")
  ),
  row.names = c(NA, 3L),
  class = "data.frame")
1

There are 1 answers

0
CSJCampbell On

Acting on the data as a vector rather than a data.frame will be much more efficient. I ran into some encoding issues so converted the text to UTF-8 using iconv; If the names contain non-ASCII characters this would need some handling.

key1 <- data.frame(
    realNames = c("________", "____________aho", "___________ass", 
        "___Yeliab", "__courtlezz", "NikkiErica21", "EmilyIsPro", "aho"),
    fakeNames = c("Abhinav_Chang", "Caleb_Dunn-Sparks", "Taryn_Hunzicker", 
        "A_A", "B_B", "C_C", "D_D", "E_E"),
    stringsAsFactors = FALSE
)

pseudonymize1 <- function(df, key) {
    mat <- as.matrix(df)
    dims <- attr(mat, which = "dim")
    cnam <- colnames(df)
    vec <- iconv(unclass(mat), from = "latin1", to = "UTF-8")
    for (name in split(key, f = seq_len(nrow(key)))) {
        vec <- gsub(
            vec, 
            pattern = name$realNames, 
            replacement = name$fakeNames, 
            fixed = TRUE)
    }
    mat <- vec
    attr(mat, which = "dim") <- dims
    df <- as.data.frame(mat, stringsAsFactors = FALSE)
    colnames(df) <- cnam
    df
}
pseudonymize1(df1, key1)
# utilisateur                                                                       texte
# 1         A_A                                                                 @D_D ik lol
# 2         B_B @C_C there was a sighting in sunset ridge too. Keep Winnie and bob safe lol
# 3         B_B                               @C_C lol yes _Ã\u0083\u0099Ã\u0083·Ã\u0083¢

library(microbenchmark)    
microbenchmark(
    pseudonymize(df1, key1),
    pseudonymize1(df1, key1)
)
# Unit: microseconds
#                     expr      min        lq     mean   median        uq      max neval cld
#  pseudonymize(df1, key1) 1842.554 1885.6750 2131.089 1994.755 2294.6850 3007.371   100   b
# pseudonymize1(df1, key1)  287.683  306.1905  333.678  314.950  339.8705  497.301   100  a 

A concern I have with 155k names is that when searching as a regular expression you will find names contained in other names. This could be in the true name within the true name (e.g. Emily within EmilyIsPro), or the true name within a previously replaced fake name. You will want to test for this, and consider using a random hash instead of a name-like fake name.