Using Azure ML components and pipelines: How to split a larger-than-disk (PGN) file into shards and save the output files to a designated uri_folder
on a blob storage? Feel free to provide any best-practices to achieve the goal.
I set up a component and a pipeline with the following yml
configuration files:
Component
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: split_file_to_shards
display_name: Split file to shards
version: 0.0.9
type: command
inputs:
input_data_file:
type: uri_file
mode: ro_mount
outputs:
output_data_dir:
type: uri_folder
mode: rw_mount
environment:
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
code: ./
command: >-
split -u -n r/100 --verbose ${{inputs.input_data_file}} ${{outputs.output_data_dir}}
Pipeline
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
experiment_name: sample-experiment
compute: azureml:vm-cluster-cpu
inputs:
input_data_file:
type: uri_file
path: azureml:larger-than-disk-file@latest
outputs:
output_data_dir:
type: uri_folder
path: azureml://datastores/<blob_storage_name>/paths/<path_to_folder>/
jobs:
split_pgn_to_shards:
type: command
component: azureml:split_file_to_shards@latest
inputs:
input_data_file: ${{parent.inputs.input_data_file}}
outputs:
output_data_dir: ${{parent.outputs.output_data_dir}}
Run commands
> az ml component create -f component.yml
> az ml job create -f pipeline.yml
I expect Azure ML to mount the input file on a ro_mount
and write the processed files to rw_mount
. I understood the remaining options download
and upload
to actively download the file to the VM's local disk and upload the files after processing to the mount, respectively, which is not what I want.
The command argument -u
in split
is used for unbuffered write to output.
From the monitoring Network I/O I unexpectedly see the file being downloaded to disk. In addition, I get the following error from the component:
Disk full while running job. Please consider reducing amount of data accessed, or upgrading VM SKU.
Total space: 6958 MB, available space: 1243 MB (under AZ_BATCH_NODE_ROOT_DIR).