Exporting a filtered subset of GCP Natural Language annotator

135 views Asked by At

I am building a training dataset on GCP's Natural Language AutoML Entity Extraction service. I have a fraction of my documents labeled and I want to export them to do some preliminary exploratory data analysis. I can add a filter to display "Labeled" docs, but if I try to export, it exports all my docs.

Is there any way to export only those that fit the filter criteria? Via Python API would be fine too.

1

There are 1 answers

0
slakov On

Indeed, the Export Data link in the AutoML console will always export the complete dataset. There is no option to export selected items only, however, there is an option to Delete selected items. A workaround that I suggest in order to achieve your task is to delete the ‘unwanted’ items. Let me explain.

I suggest you perform the following steps.

  1. Export the complete dataset (so you don't delete anything from your production dataset).
  2. Create a New Dataset in your AutoML project, by importing the complete dataset from Step 1.
  3. Filter the unlabeled documents.
  4. Select all and delete (by doing this you delete the unlabeled data in your copy dataset).

This way, your new dataset will contain only the labeled documents and you can do Export Data and use the resulting set for your EDA.

Best regards!