Logistic Lasso on large gene dataset specifically through the Knockoff package in R

30 views Asked by At

This question is perhaps in an uncanny valley between CrossValidated and StackOverflow, as I'm trying to understand the methodology of functions in an R package, in the context of executing them properly.

The data is gene expression; with thousands of variables. Outcome is a binary variable.

I have managed to get a logistic lasso from glmnet but I've been asked to do it again with the knockoff package: https://cran.r-project.org/web/packages/knockoff/index.html

The problem is, if I've understood the vignettes correctly, the choices in the package are (a) assume the response variable is a normal (not true lol) or (b) pre-specify the distribution, mu, and sigma of the predictors. Perhaps it is my inexperience showing, but I don't feel confident this dataset can work for either of those things. Those who have tasked me with this cryptically insist the package works for logistic lasso, though.

Am I missing something? How would one go about doing a logistic lasso on a massive dataset using knockoff?

1

There are 1 answers

0
rw2 On

I found this quite tricky to work out, but I think the code below should work with a logistic lasso model. I found the stat.lasso_coefdiff_bin function in the knockoff package, which I think is internally building a logistic lasso model, then computing knockoff statistics by comparing coefficients of original and knockoff predictors.

library(knockoff)
library(glmnet)
library(mlbench)
data("PimaIndiansDiabetes2", package = "mlbench")
df <- na.omit(PimaIndiansDiabetes2)

# Prepare the example data
X <- as.matrix(df[, -ncol(df)])
y <- ifelse(df$diabetes == "pos", 1, 0)
    
# Select variables
selected_vars <- knockoff::knockoff.filter(X, y, create.fixed, fdr=0.5, statistic=knockoff::stat.lasso_coefdiff_bin)

With my example data I could only get this to actually select some variables by making the false discovery rate (fdr) quite high. I'm not sure what an appropriate value might be. It will probably depend on your data, so I suggest trying it out with something lower first, maybe 0.1. You would also want to check what function to use to generate your knockoffs - I've used create.fixed but you can also use create.gaussian.

Hope that helps