Creating a new column when two columns satisfy certain conditions in R

62 views Asked by At

My data is like this:

country    supporter1   supporter2   supporter3  supporter4    supporter5    
USA           Albania     Germany        USA           NA           NA
France        USA         France         NA            NA           NA
UK            UK          Chile          Peru          NA           NA
Germany       USA         Iran           Mexico        India        Pakistan
USA           China       Spain          NA            NA           NA
Cuba          Cuba        UK             Germany       South Korea  NA
China         Russia      NA             NA            NA           NA

What I want to do is to create a new variable when the country column and one of the remaining supporter columns (supporter 1, supporter 2, supporter 3, supporter 4, and supporter 5) are the same (for instance country France and supporter2 France are the same). In this case, the new variable should take 1, 0 otherwise.

I expect to have this:

country    supporter1   supporter2   supporter3  supporter4    supporter5      new variable  
USA          Albania     Germany        USA           NA           NA               1
France       USA         France         NA            NA           NA               1
UK           UK          Chile          Peru          NA           NA               1
Germany      USA         Iran           Mexico        India        Pakistan         0
USA          China       Spain          NA            NA           NA               0         
Cuba         Cuba        UK             Germany       South Korea  NA               1
China        Russia      NA             NA            NA           NA               0
3

There are 3 answers

2
TarJae On BEST ANSWER

Update dplyr only solution Using if_any:

library(dplyr)
df %>% 
  rowwise() %>% 
  mutate(new_var = as.integer(as.logical(if_any(starts_with("supporter"), ~ . %in% country))))
  country supporter1 supporter2 supporter3 supporter4  supporter5 new_var
  <chr>   <chr>      <chr>      <chr>      <chr>       <chr>        <int>
1 USA     Albania    Germany    USA        NA          NA               1
2 France  USA        France     NA         NA          NA               1
3 UK      UK         Chile      Peru       NA          NA               1
4 Germany USA        Iran       Mexico     India       Pakistan         0
5 USA     China      Spain      NA         NA          NA               0
6 Cuba    Cuba       UK         Germany    South Korea NA               1
7 China   Russia     NA         NA         NA          NA               0

First answer: also correct: Here is one possible solution:

  1. calculate rowwise
  2. check in cols supporter1 to supporter5 if country is included
  3. unite all new columns to one and with an ifelse statement take 1 or 0
library(dplyr)
library(stringr)
library(tidyr)

df %>% 
  rowwise() %>% 
  mutate(across(supporter1:supporter5, ~ifelse(. %in% country, 1,0), .names = "new_{col}")) %>% 
  unite(New_Col, starts_with('new'), na.rm = TRUE, sep = ' ') %>% 
  mutate(New_Col = ifelse(str_detect(New_Col,  "1"), 1,0))
  country supporter1 supporter2 supporter3 supporter4  supporter5 New_Col
  <chr>   <chr>      <chr>      <chr>      <chr>       <chr>        <dbl>
1 USA     Albania    Germany    USA        NA          NA               1
2 France  USA        France     NA         NA          NA               1
3 UK      UK         Chile      Peru       NA          NA               1
4 Germany USA        Iran       Mexico     India       Pakistan         0
5 USA     China      Spain      NA         NA          NA               0
6 Cuba    Cuba       UK         Germany    South Korea NA               1
7 China   Russia     NA         NA         NA          NA               0
0
CIAndrews On

Another solution is checking per row whether country is present in one of the columns:

df <- data.frame(country=c("USA","France","UK","Germany","USA","Cuba","China"),
supporter1=c("Albania","USA","UK","USA","China","Cuba","Russia"),
supporter2=c("Germany","France","Chile","Iran","Spain","UK","NA"),  
supporter3=c("USA","NA","Peru","Mexico","NA","Germany","NA"),
supporter4=c("NA","NA","NA","India","NA","South Korea","NA"),   
supporter5=c("NA","NA","NA","Pakistan","NA","NA","NA"))

That would give:

df$new <- sapply(seq(1,nrow(df)), function(x) ifelse(df$country[x] %in% df[x,2:6],1,0))
> df$new
[1] 1 1 1 0 0 1 0
0
Rui Barradas On

Here is a base R solution.
First mapply checks for equality of suporter* and country. NA's are considered to return FALSE. Then as.integer/rowSums transforms rows with at least one TRUE into 1, otherwise 0.

eq <- mapply(\(x, y){x == y & !is.na(x)}, df1[-1], df1[1])
as.integer(rowSums(eq) != 0)
#[1] 1 1 1 0 0 1 0

df1$new_variable <- as.integer(rowSums(eq) != 0)

Data

df1 <- read.table(text = "
country    supporter1   supporter2   supporter3  supporter4    supporter5    
USA           Albania     Germany        USA           NA           NA
France        USA         France         NA            NA           NA
UK            UK          Chile          Peru          NA           NA
Germany       USA         Iran           Mexico        India        Pakistan
USA           China       Spain          NA            NA           NA
Cuba          Cuba        UK             Germany       'South Korea'  NA
China         Russia      NA             NA            NA           NA
", header = TRUE)