Data Mangement and Coding in R

81 views Asked by At

I have two questions. The first is a data management question and the second is the creation of a new variable. My data is structured, but I am not sure what is the correct r code.

I am looking at congressional committee data. My unit of analysis is each congressman and the committee they sat on during a congress. For example, if Congressman A sat on Appropriations and Ways and Means for three congresses, that would be a total of 6 observations.

First, I want to create a data set that only has the committees a member transferred to. Therefore, I would like to remove all observations that pertains to a committee that a member was given at the start of their first term in congress.

Second, after my data set only contains the committees a member transferred to after their first term in Congress, I need to create a new variable. In the new variable, I would like for a member to receive a one in the observation in which it is their last congress to serve on that committee. All other observations in which it is not the last congress for them to serve (conditioned on that committee) receive a zero.

For example, I would like for this:

data.frame(
ID = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 5L), 
Cong = c(52L, 53L, 54L, 53L, 50L, 50L, 48L, 48L, 48L, 49L, 47L, 48L, 49L, 49L), 
Comm = c(3L, 3L, 3L, 4L, 2L, 7L, 4L, 3L, 7L, 7L, 3L, 6L, 6L, 8L)

)

ID  Cong  Comm
1    52    3
1    53    3
1    54    3
1    53    4
2    50    2
2    50    7
3    48    4
4    48    3
4    48    7
4    49    7
5    47    3
5    48    6
5    49    6
5    49    8

To look like this:

ID  Cong  Comm  Y
1    53   4     1
5    48   6     0
5    49   6     1
5    49   8     1

For example, ID 1 and all observations that correspond to Comm 3 were dropped because he was assigned that committee during his first term in congress. Y is the new variable I needed to create.

ID is the member. Cong is the congress they are serving. Comm is the committee they are sitting on. (BTW, Comm is actually a categorical variable).

I can probably figure out the new variable (Y) on my own, but I am having trouble creating the new data frame that separates the committees. I apologize for any confusion and greatly appreciate any help.

1

There are 1 answers

4
alexizydorczyk On BEST ANSWER

If I am understanding your question correctly, then here is a potential quick solution with plyr.

library(plyr)

x = data.frame(
  ID = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 5L), 
  Cong = c(52L, 53L, 54L, 53L, 50L, 50L, 48L, 48L, 48L, 49L, 47L, 48L, 49L, 49L), 
  Comm = c(3L, 3L, 3L, 4L, 2L, 7L, 4L, 3L, 7L, 7L, 3L, 6L, 6L, 8L))

result  = ddply(x, "ID", .fun = function(congressman){ 

  #Find a congressman's first term
  first_term = min(congressman$Cong)

  #Find the committees he/she served on that term
  first_terms_committees = congressman$Comm[congressman$Cong == first_term]

  #Find the rows in which those committees exist
  to_remove  = which(congressman$Comm %in% first_terms_committees)
  #Remove those rows
  congressman = congressman[-to_remove,]

  congressman
})

It splits up your data by congressman. Then it finds the congressman's first term so that it can find all the committees the congressman served on in the first term. Then it simply remove all of the rows of that congressman where those first term committees appear.