I have an R script in my workflow that requires the number of entries on several csv files (a.csv, b.csv, c.csv; formatted with headers) as a value. Since they all have a string something in every line, I thought I could write the rule as follows:
configfile: config.yaml
WILDCARD = config['wildcard']
TEMP_DIR = "~/temp"
rule all:
input:
f"{TEMP_DIR}/folder/{{WILDCARD}}/output.txt"
rule combine_geno_pheno_data_sibs:
input:
f"{TEMP_DIR}/file.txt",
f"{TEMP_DIR}/folder/{{WILDCARD}}/another_file.txt",
f"config['file']",
output:
f"{TEMP_DIR}/folder/{{WILDCARD}}/output.txt"
params:
n_lines = shell(
"grep -c something ../resources/{{WILDCARD}}.csv | xargs"
)
script:
"scripts/use_lines.R"
config.yaml contains
wildcard:
- a
- b
- c
and n_lines is called in R as snakemake@params$n_lines.
The way the expansion is interpreted in shell(), though, is as grep -c something ../resources/a b c.csv, how do I get it to interpret the wildcards as e.g. grep -c something ../resources/a.csv and return the value to n_lines correctly?
Thanks in advance
Given the
config.yaml, theWILDCARDcontains a list of string values:By applying f-string formatting to rule
allinput, you are really requesting the following file:It's not clear if you are then requesting specific files from command line or if the code in the question is slightly inconsistent, but to fix your rule, consider defining a utility function that counts number of lines containing "something":
Then the rule
combine_geno_pheno_data_sibswill look like: