Just came across a .do
file that I need to translate into R
because I don't have a Stata license; my Stata is rusty, so can someone confirm that the code is doing what I think it is?
For reproducibility, I'm going to translate it into a data set I found online, specifically the Milk Production dataset (p004) that's part of a textbook by Chatterjee, Hadi and Price.
Here's the Stata code:
collapse (min) min_protein = protein ///
(mean) avg_protein = protein ///
(median) median_protein = protein ///
(sd) sd_protein = protein ///
if protein > 2.8, by(lactatio)
Here's what I think it's doing in data.table
syntax:
library(data.table)
library(foreign)
DT = read.dta("p004.dta")
setDT(DT)
DT[protein > 2.8,
.(min_protein = min(protein),
avg_protein = mean(protein),
median_protein = median(protein),
sd_protein = sd(protein)),
keyby = lactatio]
# lactatio min_protein avg_protein median_protein sd_protein
# 1: 1 2.9 3.162632 3.10 0.2180803
# 2: 2 2.9 3.304688 3.25 0.2858736
# 3: 3 2.9 3.371429 3.35 0.4547672
# 4: 4 2.9 3.231250 3.20 0.3419917
# 5: 5 2.9 3.855556 3.20 1.9086061
# 6: 6 3.0 3.200000 3.10 0.2645751
# 7: 7 3.3 3.650000 3.65 0.4949748
# 8: 8 3.2 3.300000 3.30 0.1414214
Is that correct?
This would be easy to confirm if I had used Stata in the past 18 months or if I had a copy installed--hoping I can bend the ear of someone for whom either of these is true. Thanks.
Your intuition is correct.
collapse
is the Stata equivalent of R'saggregate
function, which produces a new dataset from an input dataset by applying an aggregating function (or multiple aggregating functions, one per variable) to every variable in a dataset.Here's the output for that Stata command on the example dataset: