Azure batch job start tasks failed

2.9k views Asked by At

I'm using Azure batch python API. When I'm creating a new job, I see exit code 128 (image attached). How can I know what is the reason for that?

exit code error

I'm creating a new job using this code :

def wrap_commands_in_shell(commands):
    return "/bin/bash -c 'set -e; set -o pipefail; {}; wait'".format(';'.join(commands))

job_tasks = ['cd /mnt/batch/tasks/shared/ && git clone https://github.com/cryptobiu/OSPSI.git',
             'cd /mnt/batch/tasks/shared/OSPSI && git checkout cloud',
             'cd /mnt/batch/tasks/shared/OSPSI && cmake CMake',
             'cd /mnt/batch/tasks/shared/OSPSI && mkdir -p assets'
             ]

job_creation_information = batch.models.JobAddParameter(job_id, batch.models.PoolInformation(pool_id=pool_id),
                                                        job_preparation_task=batch.models.JobPreparationTask(
                                                            command_line=wrap_commands_in_shell(
                                                                job_tasks),
                                                            run_elevated=True,
                                                            wait_for_success=True
                                                        )
                                                        )
1

There are 1 answers

0
fpark On BEST ANSWER

To diagnose, you can look at the stderr.txt and stdout.txt for the Job Preparation task that has failed in the Azure Portal, using Azure Batch Explorer, or using an SDK via code. If you look at which node ran the job prep task, navigate to that node, then the job directory. Under the job directory, you should see a jobpreparation directory. In that directory will have the stderr.txt and stdout.txt.

With regard to the exit code, there are a few potential problems that could cause this:

  1. Did you install git, cmake and any other dependencies as part of a start task?
  2. I get a 404 when I try to navigate to: https://github.com/cryptobiu/OSPSI. Does this repo exist? If it's a private repository, are you providing the correct credentials?

A few notes about your job_tasks array:

  • You should not hardcode the paths /mnt/batch/tasks/shared. This path to the "shared" directory may not be the same between Linux distributions. You should use the environment variable $AZ_BATCH_NODE_SHARED_DIR instead. You can view a full list of Azure Batch pre-filled environment variables here.
  • You do not need to cd into the directory for each command, you only need to do it once. You can rewrite job_tasks as: ['cd $AZ_BATCH_NODE_SHARED_DIR', 'TODO: INSERT YOUR COMMANDS TO SETUP AUTH WITH GITHUB FOR PRIVATE REPO', 'git clone https://github.com/cryptobiu/OSPSI.git', 'cd OSPSI', 'cmake CMake', 'mkdir -p assets']