I wonder if I can perform data-pipeline by directory of a specific name with DataFusion

201 views Asked by At

I'm using google-cloud-platform data fusion.

Assuming that the bucket's path is as follows:

test_buk/...

In the test_buk bucket there are four files:

20190901, 20190902

20191001, 20191002

Let's say there is a directory inside test_buk called dir.

I have a prefix-based bundle based on 201909(e.g, 20190901, 20190902)

also, I have a prefix-based bundle based on 201910(e.g, 20191001, 20191002)

I'd like to complete the data-pipeline for 201909 and 201910 bundles.

Here's what I've tried:

with regex path filter gs://test_buk/dir//2019 to run the data pipeline.

If regex path filter is inserted, the Input value is not read, and likewise there is no Output value.

When I want to create a data pipeline with a specific directory in a bundle, how do I handle it in a datafusion?

1

There are 1 answers

0
rodvictor On BEST ANSWER

If using directly the raw path (gs://test_buk/dir/), you might be getting an error when escaping special characters in the regex. That might be the reason for which you do not get any input file into the pipeline that matches your filter.

I suggest instead that you use ".*" to math the initial part (given that you are also specifying the path, no additional files in other folders will match the filter).

Therefore, I would use the following expressions depending on the group of files you want to use (feel free to change the extension of the files):

path = gs://test_buk/dir/

regex path filter = .*201909.*\.csv or .*201910.*\.csv

If you would like to know more about the regex used, you can take a look at (1)