When running bioconductor packages in my snakemake pipelines on our computer cluster, I create bespoke singularity containers with the bioconductor packages I need. This has worked smoothly so far running processes that are not resource hungry.
I'm currently using a bioconductor package called tradeSeq, which uses BiocParallel to parallelise jobs for the functions fitGAM() and evalualteK(). Here is a snippet of the relevant R code I use to run this:
# Set resources
message('Setting seed and workers ...')
set.seed(10) # fitGAM is stochastic
message('Cores avaialable on Hawk: ', parallel::detectCores())
BPPARAM <- BiocParallel::bpparam()
BPPARAM$workers <- 20 # use 20 cores
BPPARAM # lists current options
# Get Knots
message('Getting knots ...')
aicMat <- evaluateK(counts = counts(shi_sce), pseudotime = pseudotime, cellWeights = cellWeights
, k = 3:20, nGenes = 500, verbose = TRUE, plot = TRUE, parallel = TRUE)
# Run fitGAM
message('Running fitGAM ...')
shi_sce <- fitGAM(counts = counts(shi_sce), pseudotime = pseudotime, cellWeights = cellWeights,
nknots = 7, verbose = FALSE, parallel = T, genes = var_genes)
My issue is that it's not clear to me how align the BiocParallel resource assignment with the resources I have specified via Slurm / snakemake. When running my R script I get the following error:
Error in reducer$value.cache[[as.character(idx)]] <- values :
wrong args for environment subassignment
Calls: evaluateK ... .bploop_impl -> .collect_result -> .reducer_add -> .reducer_add
Here is my snake rule:
rule slingshot:
input: "../results/01R_objects/seurat_shi_bc.rds",
output: "../results/01R_objects/sce_shi_bc.rds",
singularity: "../resources/containers/slingshot_latest.sif"
resources: tasks = 1, mem_mb = 100000, threads = 21, nodes = 1
params: results_dir = "../results/"
log: "../results/00LOG/06trajectory_inference/slingshot.log"
script:
"../scripts/snRNAseq_GE_slingshot.R"
And my snakemake profile config:
snakefile: Snakefile
cores: 1
#use-conda: True
use-singularity: True
keep-going: True
jobs: 10
rerun-incomplete: True
restart-times: 1
cluster:
mkdir -p ../results/00LOG/smk-logfiles &&
sbatch
--qos=maxjobs500
--ntasks={resources.tasks}
--mem={resources.mem_mb}
--time={resources.time}
--cpus-per-task={resources.threads}
--job-name=smk-{rule}
--output=../results/00LOG/smk-logfiles/{rule}.%j.out
--error=../results/00LOG/smk-logfiles/{rule}.%j.err
--account=scw1641
default-resources:
- ntasks=1
- mem_mb=5000
- time="3-00:00:00"
I have tried using MulticoreParam() and SnowParam() in BiocParallel.
The other option I've explored is using Bioconductor's batchtools using the following in my R script:
BatchtoolsParam(workers = 20, cluster = "slurm", template = tmpl)
But I'm not sure how to obtain an address for the slurm script that is spawned by snakemake to pass to the template parameter in the BatchtoolsParam function. Or if this is the best course of action.
When I run the BatchtoolsParam(workers = 20, cluster = "slurm"), so omitting the template parameter altogether, instead of the BPPARAM$workers line in the above R code I get the following error:
Setting seed and workers ...
Cores avaialable on Hawk: 40
Error in BatchtoolsParam(workers = 20, cluster = "slurm") :
'slurm' supported but not available on this machine
This suggests that despite snakemake scheduling the R script to the slurm cluster, slurm is not being detected within R when the script is running, and perhaps that the template file is necessary.
I've also tried running the script interactively, so bypassing snakemake altogether, using batchtools within the container. The fitGAM() function runs, but the jobs does not appear to being parallelised properly.
Any suggestions on how to efficiently parallelise Bioconductor processes within a container using Snakemake / Slurm would be greatly appreciated.