How not to alter duplicate names with sapply?

104 views Asked by At

I have a text vector with the names of drugs already registered, and another with the names of new drugs. I want to know whether the new drugs look like an already existing drug or not.

For example, if supercure is a drug which can be producted either by firm1 or firm2, and supercure firm1 1000mg and supercure firm2 500mg are already registered, then supercure firm1 500 mg should be associated with both of them.

agrep allows to do such matching in R, and sapply allows to do it for every drug in the new list :

new<-c("supercure firm1 500mg","randomcure firm2 1000mg","unknowncure firm2 100mg")
registered<-c("supercure firm1 1000mg","supercure firm2 500mg","randomcure firm1 1000mg")
res<-unlist(sapply(new,agrep,x=registered))
res

As expected, supercure gets two matches, randomcure one match and unknowncure no match (which is what I want). However, sapply appears to have altered the names so that there is no duplicate : supercure firm1 500mg became supercure firm1 500mg1 and supercure firm1 500mg2 :

supercure firm1 500mg1   supercure firm1 500mg2 randomcure firm2 1000mg 
                    1                       2                       3 

This is a problem because it prevents me to select matched drugs from the new list :

new[new %in% names(res)] only catches randomcure (because supercure's name has been altered).

I can think of ways of fixing this by quite graceless text processing, but is there a more clever way of getting the list of new drugs who found a match ?

The ideal output would be :

supercure firm1 500mg   supercure firm1 500mg randomcure firm2 1000mg 
                    1                       2                       3 
2

There are 2 answers

1
moodymudskipper On BEST ANSWER

sapply didn't alter the name, unlist did. This gives the desired output:

x <- sapply(new,agrep,x=registered)
setNames(unlist(x),rep(names(x),lengths(x)))
#  supercure firm1 500mg   supercure firm1 500mg randomcure firm2 1000mg 
#                      1                       2                       3
0
Sotos On

You can try to make it a data frame , stack it and use setNames to make it a named vector, i.e.

d1 <- unique(stack(data.frame(Filter(length, sapply(new,agrep,x=registered)))))
#  values                     ind
#1      1   supercure.firm1.500mg
#2      2   supercure.firm1.500mg
#3      3 randomcure.firm2.1000mg

setNames(d1$values, d1$ind)
#  supercure.firm1.500mg   supercure.firm1.500mg randomcure.firm2.1000mg 
#                      1                       2                       3