Azure ML Pipelines user_identity on parallel job

75 views Asked by At

I'm using a user_identity to read data from Azure Data Lake. I'm saving the data to a datastore. Then I want to use the datastore for my parallel job, but I keep running into this error:

Please specify a intermediate datastore for Parallel Run Step run-time when credential passthrough is enabled. Parallel Run Step will use your user identity to acceess the datastore. Warning: Please carefully control the scale of access to prevent intermediate data leak!

Here's my job definition for reference:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline

display_name: model
description: a model
identity:
  type: user_identity

jobs:

  copy_model:
    type: command
    compute: azureml:default-compute
    environment: azureml:default-env
    inputs:
      input_path:
        type: uri_folder
        path: azureml://datastores/default/paths/
        mode: ro_mount
    outputs:
      output_path:
        type: uri_folder
        path: azureml://datastores/${{default_datastore}}/paths/model
    command: |
      rsync -ah --progress ${{inputs.input_path}}/model ${{outputs.output_path}}

  copy_data:
    type: command
    compute: azureml:default-compute
    environment: azureml:default-env
    inputs:
      input_path:
        type: uri_folder
        path: azureml://datastores/default/paths/
        mode: ro_mount
      input_folder: folder
    outputs:
      output_path:
        type: uri_folder
        path: azureml://datastores/${{default_datastore}}/paths/data
    command: |
      rsync -ah --progress ${{inputs.input_path}}/${{inputs.input_folder}} ${{outputs.output_path}}

  model:
    type: parallel
    compute: azureml:default-compute
    inputs:
      score_model:
        type: uri_folder
        path: ${{parent.jobs.copy_model.outputs.output_path}}
        mode: ro_mount
      job_data_path:
        type: uri_folder
        path: ${{parent.jobs.copy_data.outputs.output_path}}
        mode: ro_mount
    outputs:
      output_path: 
        type: uri_file
        path: azureml://datastores/${{default_datastore}}/paths/results/output.csv
        mode: rw_mount

    mini_batch_size: "1"
    resources:
      instance_count: 1
    mini_batch_error_threshold: 5
    logging_level: "DEBUG"
    input_data: ${{inputs.job_data_path}}
    max_concurrency_per_instance: 2
    retry_settings:
      max_retries: 2
      timeout: 60
    
    task:
      type: run_function
      code: ./src
      entry_script: model.py
      environment: azureml:default-env
      program_arguments: >-
        --model-path ${{inputs.score_model}}
      append_row_to: ${{outputs.output_path}}
0

There are 0 answers