I am attempting to tune a VertexAI Language model. When hitting 'Start Tuning' it loads for about a second and then stops. I also receive an internal error occurred message when attempting to train from the console.
After following the recommend troubleshooting tips of google by running this command:
PROJECT_ID=(I put my project id here)
curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://europe-west4-aiplatform.googleapis.com/ui/projects/${PROJECT_ID}/locations/europe-west4/datasets \
-d '{
"display_name": "test-name1",
"metadata_schema_uri": "gs://google-cloud-aiplatform/schema/dataset/metadata/image_1.0.0.yaml",
"saved_queries": [{
"display_name": "saved_query_name",
"problem_type": "IMAGE_CLASSIFICATION_MULTI_LABEL"
}]
}'
and then trying again, it still doesn't work.
I then tried to run the console command:
PROJECT_ID=(I put my project id here)
DATASET_URI=(Here I put my valid dataset uri)
OUTPUT_DIR=(I put my output dir here)
curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
"https://europe-west4-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/europe-west4/pipelineJobs?pipelineJobId=tune-large-model-$(date +%Y%m%d%H%M%S)" -d \
$'{
"displayName": "DisplayName",
"runtimeConfig": {
"gcsOutputDirectory": "'${OUTPUT_DIR}'",
"parameterValues": {
"project": "'${PROJECT_ID}'",
"model_display_name": "ModelName",
"dataset_uri": "'${DATASET_URI}'",
"location": "us-central1",
"large_model_reference": "text-bison@001",
"train_steps": 500
}
},
"templateUri": "https://us-kfp.pkg.dev/ml-pipeline/large-language-model-pipelines/tune-large-model/v1.0.0"
This command returned the following output:
{
"error": {
"code": 500,
"message": "Internal error encountered.",
"status": "INTERNAL"
}
}
it turns out I finally figured out what was causing the issue with the VertexAI Language model tuning and training!
The problem was that the default compute engine service account had been deleted from my project, and this caused the tuning job to fail. Since the account had been deleted more than 30 days ago, I couldn't restore it, and that's why the job couldn't complete successfully. This deletion of the default service account affected various operations within Google Cloud Platform, leading to an internal error during the process.
Unfortunately, there was no way to bring back the deleted default service account, and this meant that the project was stuck with this problem. Therefore, I had to make the decision to start a new project from scratch to ensure that the default service account is intact and not deleted inadvertently.
For others facing a similar issue, I'd strongly advise being cautious when handling service accounts and other critical components of your projects. Ensure that you don't delete the default service account or make any changes that could lead to unintended consequences down the road.
Now, with the new project in place and all necessary service accounts properly configured, I can proceed with the tuning and training processes without any disruptions. I hope this solution helps anyone else facing the same problem.