Match individual records during Batch predictions with VertexAI pipeline

416 views Asked by At

I have a custom model in Vertex AI and a table storing the features for the model along with the record_id.
I am building pipeline component for the batch prediction and facing a critical issue. When I submit the batch_prediction, I should exclude the record_id for the job but How can I map the record if I don't have the record_id in the result?

from google.cloud import bigquery
from google.cloud import aiplatform

aiplatform.init(project=project_id)
client = bigquery.Client(project=project_id)

query = '''
SELECT * except(record_id) FROM `table`
'''
df = client.query(query).to_dataframe()  # drop the record_id and load it to another table 
job = client.load_table_from_dataframe(
    X, "table_wo_id",
) 

clf = aiplatform.Model(model_id = 'custom_model')
clf.batch_predict(job_display_name = 'custom model batch prediction',
                 bigquery_source = 'bq://table_wo_id',
                 instances_format = 'bigquery',
                 bigquery_destination_prefix = 'bq://prediction_result_table',
                 predictions_format = 'bigquery',
                 machine_type = 'n1-standard-4',
                 max_replica_count = 1
                 )

like the above example, there is no record_id column in prediction_result_table. There is no way to map the result back to each record

1

There are 1 answers

0
Juan Egas On

I also found the REST API solution to exclude fields for the Vertex AI batch prediction job (link below).

I also noticed that the output of the batch_predict method, when the output format is jsonl, preserves the order of the input instances. So, you can match the output to record_id based on the index (as a temporary solution).

Wondering if someone found a solution using the Python SDK.

Create the configuration:

import json

request_with_excluded_fields = {
    "displayName": f"{BATCH_JOB_NAME}-excluded_fields",
    "model": MODEL_URI,
    "inputConfig": {
        "instancesFormat": INPUT_FORMAT,
        "bigquerySource": {"inputUri": INPUT_URI},
    },
    "outputConfig": {
        "predictionsFormat": OUTPUT_FORMAT,
        "bigqueryDestination": {"outputUri": OUTPUT_URI},
    },
    "dedicatedResources": {
        "machineSpec": {
            "machineType": MACHINE_TYPE,
        }
    },
    "instanceConfig": {"excludedFields": EXCLUDED_FIELDS},
}

with open("request_with_excluded_fields.json", "w") as outfile:
    json.dump(request_with_excluded_fields, outfile)

Then you send the request:

!curl \
  -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  -d @request_with_excluded_fields.json \
  https://{REGION}-aiplatform.googleapis.com/v1beta1/projects/{PROJECT_ID}/locations/{REGION}/batchPredict

Sourced from this sample notebook: https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/prediction/custom_batch_prediction_feature_filter.ipynb