Troubleshooting GATK Pipeline Error in AWS ParallelCluster Environment and Spark

42 views Asked by At

I'm attempting to run a GATK (v4.4.0.0) pipeline called PathSeqPipelineSpark for metagenomic analysis. However, I'm encountering an issue when trying to run it in an AWS ParallelCluster environment. The specific error I'm receiving is:

12:56:43.768 INFO  PathSeqPipelineSpark - Spark verbosity set to INFO (see --spark-verbosity argument)
12:56:43.808 INFO  AbstractConnector - Stopped Spark@75ed636c{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
12:56:43.810 INFO  SparkUI - Stopped Spark web UI at http://compute-128-dy-r6a4xlarge-1.hpc-ireland.pcluster:4040
12:56:43.819 INFO  MapOutputTrackerMasterEndpoint - MapOutputTrackerMasterEndpoint stopped!
12:56:43.837 INFO  MemoryStore - MemoryStore cleared
12:56:43.837 INFO  BlockManager - BlockManager stopped
12:56:43.840 INFO  BlockManagerMaster - BlockManagerMaster stopped
12:56:43.842 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint - OutputCommitCoordinator stopped!
12:56:43.853 INFO  SparkContext - Successfully stopped SparkContext
12:56:43.853 INFO  PathSeqPipelineSpark - Shutting down engine
[October 27, 2023 at 12:56:43 PM UTC] org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqPipelineSpark done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=103079215104
***********************************************************************

A USER ERROR has occurred: Failed to read bam header from output/BAM/Pt0-GSM3454529-unaligned.bam
 Caused by:failure to login: javax.security.auth.login.LoginException: **java.lang.NullPointerException: invalid null input: name**
        at jdk.security.auth/com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:67)
        at jdk.security.auth/com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:134)
        at java.base/javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
        at java.base/javax.security.auth.login.LoginContext$4.run(LoginContext.java:679)
        at java.base/javax.security.auth.login.LoginContext$4.run(LoginContext.java:677)
        at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
        at java.base/javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:677)
        at java.base/javax.security.auth.login.LoginContext.login(LoginContext.java:587)
        at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:2065)
        at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1975)
        at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:719)
        at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:669)
        at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:579)
        at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3746)
        at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3736)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3520)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:288)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:524)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
        at org.disq_bio.disq.impl.file.HadoopFileSystemWrapper.isDirectory(HadoopFileSystemWrapper.java:101)
        at org.disq_bio.disq.HtsjdkReadsRddStorage.read(HtsjdkReadsRddStorage.java:148)
        at org.disq_bio.disq.HtsjdkReadsRddStorage.read(HtsjdkReadsRddStorage.java:127)
        at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getHeader(ReadsSparkSource.java:188)
        at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeReads(GATKSparkTool.java:575)
        at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeToolInputs(GATKSparkTool.java:554)
        at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:544)
        at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)


***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
12:56:43.856 INFO  ShutdownHookManager - Shutdown hook called
12:56:43.857 INFO  ShutdownHookManager - Deleting directory /mnt/efs/clusterfcs/apps/CSI-microbies/CSI-Microbes-identification/test-10x-v2/lscratch/4219/tmp/spark-7eff556e-f790-4a8e-868f-5e0161da833c
Using GATK jar /mnt/efs/clusterfcs/apps/GATK/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar

It seems that Spark or Hadoop is attempting to retrieve a username, but it's coming up as null in the system.

In this case, Spark is being utilized in the pipeline for parallel processing, as running it serially is not feasible. Since Spark is self-contained within GATK, further configuration theoretically shouldn't be necessary.

Currently, I'm working on AWS using a ParallelCluster environment where I have various Slurm partitions set up with different EC2 machines and I am launch the jobs using sbatch.

In an attempt to address the issue, I've tried passing the following as environment variables:

export HADOOP_USER_NAME=user
export SPARK_USER=user

This was done in an effort to create "dummy" users and hopefully prevent the null error.

Is there anything else I can try to resolve this issue? Thank you!

0

There are 0 answers