I'm trying to figure out how to extract the read-group lane information from a fastq file, and then use this string within my GATK AddOrReplaceReadGroups Snakemake below (below). I've written a short Python function (at the top of the rule) to do this operation, but can't quite figure out how to actually use it in the script such that I can access the string output (e.g., "ABCHGSX35:2") of the function on a given fastq file in my GATK shell command. Basically, I want to feed {params.PathToReadR1} into the extract_lane_info function, but can't figure out how to integrate them correctly.
Open to any solutions to the problem posed, or if there's an entirely more efficiently and different way to achieve the same result (getting the read-group lane info as a param value), that'd be great, too. Thanks for any help.
def extract_lane_info(file_path):
# Run the bash command using subprocess.run()
elements = subprocess.run(["zcat", file_path], stdout=subprocess.PIPE).stdout.split(b"\n")[0].split(b":")[2:4]
# Extract the lane information from the output of the command using a regular expression
read_group = elements[0].decode().strip("'")
string_after = elements[1].decode().strip("'")
elements = read_group + ":" + string_after
return(elements)
rule AddOrReplaceReadGroups:
input:
"results/{species}/bwa/merged/{read}_pese.bam",
output:
"results/{species}/GATK/AddOrReplace/{read}.pese.bwa_mem.addRG.bam",
threads: config["trimmmomatic"]["cpu"]
log:
"results/{species}/GATK/AddOrReplace/log/{read}_AddOrReplace.stderrs"
message:
"Running AddOrReplaceReadGroups on {wildcards.read}"
conda:
config["CondaEnvs"]
params:
ReadGroupID = lambda wildcards: final_Prefix_ReadGroup[wildcards.read],
PathToReadR1 = lambda wildcards: final_Prefix_RawReadPathR1[wildcards.read],
LIBRARY = lambda wildcards: final_Prefix_ReadGroupLibrary[wildcards.read],
shell:"gatk AddOrReplaceReadGroups -I {input} -O {output} -ID {params.ReadGroupID}.1 -LB {params.LIBRARY} -PL illumina -PU {input.lane_info}:{params.ReadGroupID} -SM {params.ReadGroupID} --SORT_ORDER 'coordinate' --CREATE_INDEX true 2>> {log}"
Basically, {params.PathToReadR1} would be "path/to/file.fastq.gz", and I want this file to be inputted into the extract_lane_info function, and then the output of this function to be used in the -PU section of the shell command (e.g., {params.lane_info}. I keep getting all types of errors as I've messed around with it, and am unsure how to move forward.