Gives INVALID_ARGUMENT error when pushing a new unstructured data into datastore

204 views Asked by At

Hi All, So I am new to google vertex AI search, So I was doing some hands-on of creating a new unstructured document(pdf in my case) and push that document to already created data store which already have some unstructured documents.

so When I hit the below custom API from postman https://discoveryengine.googleapis.com/v1alpha/projects/548443691128/locations/global/collections/default_collection/dataStores/kaggle-movie_1698160258041/branches/0/documents

and added documentId = 01 in params then I get INVALID_ARGUMENT error postman error

Below is the format of the unstructured document that I created by following

https://cloud.google.com/discovery-engine/docs/reference/rest/v1alpha/projects.locations.collections.dataStores.branches.documents#content

{
    "name": "projects/548443691128/locations/global/collections/default_collection/dataStores/alphabet-investor_1698161197344/branches/0/documents/01",
    "id": "01",
    "schemaId": "default_schema",
    "structData": {},
    "parentDocumentId": "01",
    "content": {
        "mimeType": "application/pdf",
        "uri": "gs://personal-beta/testing-doc/Global iJobs Policy.pdf"
    }
}

In this case the document name Global iJobs Policy.pdf is present in the google cloud storage bucket. I think there might be some problem with the format of the document, but I am not able to figure out.

I tried following the officials documentation but did not get any hint about, what could have gone wrong.

2

There are 2 answers

0
Holt Skinner On BEST ANSWER

Here's the REST API sample for how to import data from Cloud Storage after creating a data store using the Cloud Console.

You will need to use the import method of the projects.locations.collections.dataStores.branches.documents resource.

https://cloud.google.com/generative-ai-app-builder/docs/create-data-store-es#discoveryengine_v1_generated_DocumentService_ImportDocuments_sync-drest

0
Meet Vora On

The default data_schema is custom which expects the input document in JSON format.

DATA_SCHEMA: Optional. Values are document, custom, csv, and content. The default is document.

https://cloud.google.com/generative-ai-app-builder/docs/create-data-store-es#discoveryengine_v1_generated_DocumentService_ImportDocuments_sync-drest

So to upload an unstructured document change data_schema to content.

Here is python code snippet for your use case.

 if gcs_uri:
    request = discoveryengine.ImportDocumentsRequest(
        parent=parent,
        gcs_source=discoveryengine.GcsSource(
            input_uris=[gcs_uri], data_schema="content"
        ),
        # Options: `FULL`, `INCREMENTAL`
        reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
    )
else:
    request = discoveryengine.ImportDocumentsRequest(
        parent=parent,
        bigquery_source=discoveryengine.BigQuerySource(
            project_id=project_id,
            dataset_id=bigquery_dataset,
            table_id=bigquery_table,
            data_schema="content",
        ),
        # Options: `FULL`, `INCREMENTAL`
        reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
    )