I'm doing t-tests with multiple grouping variables (markers) which only have two groups (0 or 1). In the complete data there are a million grouping variables, eg n_obs = 1e+06, nvals=300, 5% NA.
> n_obs = 1e+04 # to simulate grouping matrix
> n_vals = 100
> g = matrix(sample(0:1, n_obs * n_vals, replace=TRUE), n_obs, n_vals)
> row.names(g) = paste("marker", 1:nrow(g), sep="")
> colnames (g) = paste("country", 1:ncol(g), sep="")
> g[1:5,1:2]
country1 country2 country3 country4 country5
marker1 1 1 1 1 0
marker2 1 0 0 0 0
> vals = rnorm (n_vals) ; names(vals) = colnames(g) # to simulate values
> head(vals)
country1 country2 country3 country4 country5 country6
-0.4048584 0.2792725 0.4064460 0.9002677 0.2187961 0.2141666
> res = apply(g, 1, function(x) t.test(vals~ x)) ## applying the t-tests. Quite slow.
> tres = do.call(rbind, lapply(res, tidy)) ## tidying the t-tests. Very slow :(
> head(tres)
estimate estimate1 estimate2 statistic p.value parameter conf.low
marker1 -0.03560203 -0.07373907 -0.03813704 -0.17495425 0.8615063 90.52404 -0.4398452
marker2 0.27284988 0.07194537 -0.20090451 1.33127950 0.1863794 92.20240 -0.1341928
Because the tidy is so slow with larger data-sets, I was thinking of doing the t-test in separate parts, and looping through 'g' row-by-row, to generate each component of the t-test.
I can 'split' the values for the first marker, and then get the sums for each group:
> mysplit = split( vals, g[1,])
> lapply(mysplit, mean)
$`0`
[1] -0.07373907
$`1`
[1] -0.03813704
How can I 'loop' through all of the rows of 'g', getting the sums of 'vals' for each group, then the standard deviation etc.?
I'm trying to keep functions simple for speed.