I have created a monitoring schedule to monitor predictions from a Batch Transform job. The schedule runs fine when the input dataset_format
in BatchTransformInput
is csv. However, my batch job is part of a workflow that takes as an input gz format.
Documentation suggests that MonitoringDatasetFormat
only supports csv, json and parquet, can I defined it as gz?
from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor import CronExpressionGenerator
from sagemaker.model_monitor import BatchTransformInput
from sagemaker.model_monitor import MonitoringDatasetFormat
from time import gmtime, strftime
my_monitor= DefaultModelMonitor(
role=role,
instance_count=1,
instance_type="ml.m5.xlarge",
volume_size_in_gb=20,
max_runtime_in_seconds=3600,
)
my_monitor.create_monitoring_schedule(
monitor_schedule_name=mon_schedule_name,
# Inputs to run the monitoring schedule on the batch transform
batch_transform_input=BatchTransformInput(
data_captured_destination_s3_uri=s3_capture_upload_path,
destination="/opt/ml/processing/input",
dataset_format=MonitoringDatasetFormat.csv(header=False),
),
output_s3_uri=s3_report_path,
statistics=statistics_path,
constraints=constraints_path,
schedule_cron_expression=CronExpressionGenerator.hourly(),
enable_cloudwatch_metrics=True,
)
The default model monitor supports only these formats. I think you can do post processing to change form gz to one of these formats. Please refer the link below for post processing - https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-pre-and-post-processing.html