Find error record file while processing too many files in same bucket in apache beam java sdk

98 views Asked by At

I have 20 files (csv files) in the same bucket. I am able to read all the file in one go and load on to bigquery. But when there is some data type mismatches, im able to get that row into invalidDataTag where as i am unable to find the file name that has the error record.

inputFilePattern is gs://bucket-name/* this picks up all the files that are present under the bucket. and reading the files as below

PCollection<String> sourceData = pipeline.apply(Constants.READ_CSV_STAGE_NAME, TextIO.read().from(options.getInputFilePattern()));

Is there a way where i can find the file name that has the error row in it ?

1

There are 1 answers

1
Kenn Knowles On

My suggestion would be to add a column to the BigQuery table that indicates which file the record came from.