Say I have a simple DAG representing a confounding variable X = Smoking, a treatment T and outcome Y = Death such that:
T ~ X
Y ~ T + X
Is it possible to produce a synthetic dataset of say 1m observations that follows some specified conditional probabilities:
# Pr(smoking):
smoking <- data.frame(
smoking = c(0, 1),
proba = c(0.7, 0.3)
)
# Pr(treatment | smoking):
treatment <- expand.grid(
smoking = c(0, 1),
treatment = c(0, 1)
) %>% arrange(smoking, treatment)
treatment$proba <- c(0.8, 0.2, 0.45, 0.55)
# Pr(death | treatment, smoking):
death <- expand.grid(
treatment = c(0, 1),
smoking = c(0,1),
dead = c(0,1)
) %>%
arrange(treatment, smoking, dead)
death$proba <- c(0.9, 0.1, 0.2, 0.8, 0.89, 0.11, 0.5, 0.5)
I can do this manually here because it's a very basic DAG but I was wondering if it can be done in another more scalable way, using something like bnlearn
.
Current solution:
db <- data.frame(
smoking = rbinom(n = 1000000, size = 1, prob = 0.3)
)
db$treatment[db$smoking == 0] <- rbinom(n = sum(db$smoking == 0), size = 1, prob = 0.2)
db$treatment[db$smoking == 1] <- rbinom(n = sum(db$smoking == 1), size = 1, prob = 0.55)
db$dead[db$treatment == 0 & db$smoking == 0] <- rbinom(
n = sum(db$treatment == 0 & db$smoking == 0),
size = 1, prob = 0.1
)
db$dead[db$treatment == 0 & db$smoking == 1] <- rbinom(
n = sum(db$treatment == 0 & db$smoking == 1),
size = 1, prob = 0.8
)
db$dead[db$treatment == 1 & db$smoking == 0] <- rbinom(
n = sum(db$treatment == 1 & db$smoking == 0),
size = 1, prob = 0.11
)
db$dead[db$treatment == 1 & db$smoking == 1] <- rbinom(
n = sum(db$treatment == 1 & db$smoking == 1),
size = 1, prob = 0.5
)
It will be easier to let existing packages do this for you; like
bnlearn
. You can usecustom.fit
to specify the DAG and the CPTs and then userbn
to draw samples from it.An example