Invalid data binding expression when running AzureML pipeline

352 views Asked by At

I'm running an AzureML pipeline using the command line where the sole job (for now) is a sweep.

When I run run_id=$(az ml job create -f path_to_pipeline/pipeline.yaml --query name -o tsv -g grp_name -w ws-name), I get the following error:

ERROR: Met error <class 'Exception'>:{
  "result": "Failed",
  "errors": [
    {
      "message": "Invalid data binding expression: inputs.data, outputs.model_output, search_space.batch_size, search_space.learning_rate",
      "path": "command",
      "value": "python train.py --data_path ${{inputs.data}} --output_path ${{outputs.model_output}} --batch_size ${{search_space.batch_size}} --learning_rate ${{search_space.learning_rate}}"
    }
  ]
}

The pipeline yaml looks like this:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: pipeline_with_hyperparameter_sweep
description: Tune hyperparameters
settings:
  default_compute: azureml:compute-name  # sub with your compute name
jobs:
  sweep_step:
    type: sweep
    inputs:
      data:
        type: uri_file
        path: azureml:code_train_data:1  #data store I created
    outputs:
      model_output:
    sampling_algorithm: random
    search_space:
      batch_size:
        type: choice
        values: [1, 5, 10, 15]
      learning_rate:
        type: loguniform
        min_value: -6.90775527898 # ln(0.001)
        max_value: -2.30258509299 # ln(0.1)
    trial:
      code: ../src
      command: >-
        python train.py 
        --data_path ${{inputs.data}} 
        --output_path ${{outputs.model_output}} 
        --batch_size ${{search_space.batch_size}} 
        --learning_rate ${{search_space.learning_rate}}
      environment: azureml:env_finetune_component:1
    objective:
      goal: maximize
      primary_metric: bleu_score
    limits:
      max_total_trials: 5
      max_concurrent_trials: 3
      timeout: 3600
      trial_timeout: 720

For the train.py file, note that I of course have a lot of actual code in in the main function, but I commented it out with pass to check if it makes a difference and the error is the same. So the problem is upstream with the bindings, not what's inside of train.

import argparse

def main(args):
    pass

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_arguments("--data_path")
    parser.add_arguments("--output_path")
    parser.add_arguments("--batch_size", type=int)
    parser.add_arguments("--learning_rate", type=float)
    args = parser.parse_args()

    return args


if __name__ == "__main__":

    args = parse_args()

    main(args)

If helpful, here's output when I run az version:

{
  "azure-cli": "2.53.0",
  "azure-cli-core": "2.53.0",
  "azure-cli-telemetry": "1.1.0",
  "extensions": {
    "ml": "2.20.0"
  }
}
2

There are 2 answers

0
matsuo_basho On BEST ANSWER

I found the solution. The pipeline.yaml syntax for trial is in fact just trial: filename.yaml:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: codegen_sweep
description: Tune hyperparameters
settings:
  default_compute: azureml:roma2
jobs:
  sweep_step:
    type: sweep
    inputs:
      data_path:
        type: uri_file
        path: azureml:code_train_data:1
    outputs:
      model_output:
    sampling_algorithm: random
    search_space:
      batch_size:
        type: choice
        values: [1, 5, 10, 15]
      learning_rate:
        type: loguniform
        min_value: -6.90775527898 # ln(0.001)
        max_value: -2.30258509299 # ln(0.1)
    trial: ./train.yaml
    objective:
      goal: maximize
      primary_metric: eval_bleu_score # how mlflow outputs in other models
    limits:
      max_total_trials: 5
      max_concurrent_trials: 3
      timeout: 3600 # 1 hour
      trial_timeout: 720 # 20 mins

There was another problem. In the train.yaml file, my source directory is parallel, so I needed to specify using ../src:

$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
type: command

name: train_model
display_name: train_model
version: 1

inputs:
  data_path:
    type: uri_file
  batch_size:
    type: integer
  learning_rate:
    type: number

outputs:
  model_output:
    type: mlflow_model

code: ../src

environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest

command: >-
  python train.py
    --data_path ${{inputs.data_path}}
    --output_path ${{outputs.model_output}}
    --batch_size ${{inputs.batch_size}}
    --learning_rate ${{inputs.learning_rate}}

Note I simplified the arguments just to focus on getting this to work. Additionally, I fixed the parser.add_arguments as per one of the comments.

1
Pavan V Parekh On

The error message "Invalid data binding expression", typically indicates an issue with how the inputs and outputs are specified in your AzureML pipeline YAML file.

Looking at your YAML file, everything seems to be structured correctly. However, there's one small mistake in the outputs section of the sweep_step job. It's missing the type property.

Here's the corrected section:

outputs:
  model_output:
    type: azureml:artifact

Make sure to add the type property with the value azureml:artifact as shown above. This specifies that model_output is an output artifact.

If you still encounter the error after making this change, double-check all your references and paths in the pipeline YAML file. Ensure that the inputs and outputs are correctly defined and referenced throughout the file.