Validate JSON Lines from public YAML Schema — Preparing Data for AutoML Entity Extraction

87 views Asked by At

I am trying to create a entity extraction model with google's autoML, and I am already stuck in the preparing data step. Could someone explain to me how we are supposed to create schema validations for json files from yaml files or jsonl files from yaml files?

Google specifies that Input file types for entity extraction must be JSON Lines. The format, field names, and value types for JSON Lines files are determined by a schema file, which are publicly accessible YAML files.. This makes a lot of sense to me — the model needs data to be prepared in a certain way for it to process the data. However, I cannot wrap my head around how to create this JSON Lines file according to the schema using VS Code.

When I read the schema specification, I proceeded to format my data in a way that seemed reasonable and in line with the required YAML schema where the property content is the only one required:

{"type": "object", "properties": {"type": {"type": "string", "enum": ["textContent"]}, "textContent": {"type": "string", "description": "Use code redacted tweet"}}, "discriminator": {"propertyName": "type"}}
{"type": "object", "properties": {"type": {"type": "string", "enum": ["textContent"]}, "textContent": {"type": "string", "description": "Use code redacted tweet 2"}}, "discriminator": {"propertyName": "type"}}
title: TextExtraction
description: >
  Import and export format for importing/exporting text together with text
  segment annotations. Can be used in Dataset.import_schema_uri field.
type: object
required:
- content
properties:
  content:
    oneOf:
    - type: object
      properties:
        type:
          type: string
          enum: [textContent]
        textContent:
          type: string
          description: Full length text content. Up to 10MB in size.
    - type: object
      properties:
        type:
          type: string
          enum: [textGcsUri]
        textGcsUri:
          type: string
          description: >
            A Google Cloud Storage URI pointing to a text file. Up to 10MB in
            size. Supported file mime types: `text/plain`.
    discriminator:
      propertyName: type

Sadly, I made this without any schema validation from the IDE, because I have not been able to figure out how to do that. So, it obviously was not accepted by Google.

I tried to edit the settings.json file like this:

{
    "files.autoSave": "afterDelay",
    "markdown-preview-enhanced.previewTheme": "one-dark.css",
    "markdown-preview-enhanced.codeBlockTheme": "one-dark.css",
    "redhat.telemetry.enabled": true,
    "yaml.schemas": {
        "gs://google-cloud-aiplatform/schema/dataset/ioformat/text_extraction_io_format_1.0.0.yaml." : ["googleSchema.yaml"]
    },
    "json.schemas": [{
        "url": "gs://google-cloud-aiplatform/schema/dataset/ioformat/text_extraction_io_format_1.0.0.yaml"
    }]
}

But neither the JSON Lines file nor the plain JSON file (another file I have that just contains what the JSONL file has but in regular JSON format) have shown me the option to add a schema.

Image: VSCODE not showing any options for schema validation

So, can someone help me understand how we are supposed to use these schemas that are provided by the cloud service providers, how to set them up using VS Code, and/or how I can use them to validate my data. I really would greatly appreciate any best practices resources or comments at all. I just want to learn how to do things the right way to become a better engineer.

EXTRA DETAILS:

CODE:

from secret import secret
import json
import asyncio
from twscrape import API, gather


async def main():
    api = API()

    # add accounts
    await api.pool.add_account(secret.user1, secret.password1, secret.user1email, secret.password1email)

    await api.pool.login_all()

    pm = await gather(api.search("postmates code -delivery -hungry -euro -first -new", limit=100))
    ub = await gather(api.search("uber eats code -delivery -hungry -euro -first -new -hallo", limit=100)) 

    return [(tweet.rawContent, tweet.date) for tweet in pm] + [(tweet.rawContent, tweet.date) for tweet in ub]

async def run_main():
    result = await main()
    return result

res = await run_main()

res_as_json = [{"text": text, "date": dt.isoformat()} for text, dt in res]

with open('outputfile.json', 'w') as fout:
    json.dump(res_as_json, fout)

f = open('outputfile.json')

data = json.load(f)

import jsonlines
schema_data = []
for item in data:
    schema_item = {
        "type": "object",
        "properties": {
            "type": {
                "type": "string",
                "enum": ["textContent"]
            },
            "textContent": {
                "type": "string",
                "description": item['text']
            }
        },
        "discriminator":{
            "propertyName": "type"
        }
    }
    schema_data.append(schema_item)

with jsonlines.open("data.jsonl", mode="w") as writer:
    for item in schema_data:
        writer.write(item)

ERRORS:

Error: Index file gs://cloud-ai-platform-8bbb8f31-0e76-4c82-9572-753eeb190ba4/data.jsonl line 105 parsing error: Message type "google.cloud.aiplatform.master.schema.TextExtractionIoFormat" has no field named "type" at "TextExtractionIoFormat". Available Fields(except extensions): "['textContent', 'textGcsUri', 'languageCode', 'textSegmentAnnotations', 'dataItemResourceLabels']" for:
Error: Index file gs://cloud-ai-platform-8bbb8f31-0e76-4c82-9572-753eeb190ba4/data.jsonl line 187 parsing error: Message type "google.cloud.aiplatform.master.schema.TextExtractionIoFormat" has no field named "type" at "TextExtractionIoFormat". Available Fields(except extensions): "['textContent', 'textGcsUri', 'languageCode', 'textSegmentAnnotations', 'dataItemResourceLabels']" for:
Error: Index file gs://cloud-ai-platform-8bbb8f31-0e76-4c82-9572-753eeb190ba4/data.jsonl line 57 parsing error: Message type "google.cloud.aiplatform.master.schema.TextExtractionIoFormat" has no field named "type" at "TextExtractionIoFormat". Available Fields(except extensions): "['textContent', 'textGcsUri', 'languageCode', 'textSegmentAnnotations', 'dataItemResourceLabels']" for:
Error: Index file gs://cloud-ai-platform-8bbb8f31-0e76-4c82-9572-753eeb190ba4/data.jsonl line 70 parsing error: Message type "google.cloud.aiplatform.master.schema.TextExtractionIoFormat" has no field named "type" at "TextExtractionIoFormat". Available Fields(except extensions): "['textContent', 'textGcsUri', 'languageCode', 'textSegmentAnnotations', 'dataItemResourceLabels']" for:
Error: Index file gs://cloud-ai-platform-8bbb8f31-0e76-4c82-9572-753eeb190ba4/data.jsonl line 223 parsing error: Message type "google.cloud.aiplatform.master.schema.TextExtractionIoFormat" has no field named "type" at "TextExtractionIoFormat". Available Fields(except extensions): "['textContent', 'textGcsUri', 'languageCode', 'textSegmentAnnotations', 'dataItemResourceLabels']" for:
Error: Index file gs://cloud-ai-platform-8bbb8f31-0e76-4c82-9572-753eeb190ba4/data.jsonl line 193 parsing error: Message type "google.cloud.aiplatform.master.schema.TextExtractionIoFormat" has no field named "type" at "TextExtractionIoFormat". Available Fields(except extensions): "['textContent', 'textGcsUri', 'languageCode', 'textSegmentAnnotations', 'dataItemResourceLabels']" for:
Error: Index file gs://cloud-ai-platform-8bbb8f31-0e76-4c82-9572-753eeb190ba4/data.jsonl line 54 parsing error: Message type "google.cloud.aiplatform.master.schema.TextExtractionIoFormat" has no field named "type" at "TextExtractionIoFormat". Available Fields(except extensions): "['textContent', 'textGcsUri', 'languageCode', 'textSegmentAnnotations', 'dataItemResourceLabels']" for:
Error: Index file gs://cloud-ai-platform-8bbb8f31-0e76-4c82-9572-753eeb190ba4/data.jsonl line 97 parsing error: Message type "google.cloud.aiplatform.master.schema.TextExtractionIoFormat" has no field named "type" at "TextExtractionIoFormat". Available Fields(except extensions): "['textContent', 'textGcsUri', 'languageCode', 'textSegmentAnnotations', 'dataItemResourceLabels']" for:

Also, if you read this and have any comments on my project I'd love to chat. I am trying to extract promo codes from tweets using entity extraction.

0

There are 0 answers