I can run CsvExampleGen without an error message, but the outputs (and inputs) of the resulting Examples are always empty.
I am using tfx==0.24.0.
To use CsvExampleGen for reading CSV files, according to the docu & tutorials (incl. https://www.tensorflow.org/tfx/guide/examplegen ) + the release notes for tfx 0.23.0/0.24.0 ( https://github.com/tensorflow/tfx/releases ), the following lines of code should suffice to read a CVS file:
from tfx.components import CsvExampleGen
example_gen = CsvExampleGen(input_base=data_path)
where "data_path" identifies a directory with CVS files. (Note that the code differs from the official docu in that is does not use "external_input"; instead it follows the new interface documented in the release notes for 0.23.0.)
From tutorials I gather that a single, simple CVS file should suffice for testing (though I tried with up to 7 files).
I do not get any error message (except for one which I am told to ignore if I don't have a GPU available); however, the outputs (and inputs) of the resulting structure are empty (empty list and empty set / dict, respectively). I think they should not be empty, however.
The CSV files in question ARE found and touched, because if I introduce an error there (like an additional column in one row), I do get an error message.
I tried this with a stand-alone function as well as inside a pipeline (run with BeamDagRunner, for simplicity). The pipeline does generate a metadata.db, but I cannot find any trace of the CSV data there (like column names). Adding a StatisticsGen to the pipeline didn't help any further.
I tried this with the iris dataset, with and without column headers. I also tried with up to 7 small, artificial CVS files within data_path, alternatively with purely numerical and mixed numerical/categorial data and alternatively with commas and semicolons as separators. The result is always the same.
Do I have a problem with the code, or maybe with some configuration or libraries?
Here is the full code (as far as possibly relevant):
PIPELINE_NAME = "X-pipeline-iris2"
BASE_PATH = r"C:\***\FX_Experiments"
BASE_PATH_PIPELINE = os.path.join(BASE_PATH, "pipeline")
BASE_PATH_TESTS = os.path.join(BASE_PATH, "tests")
PIPELINE_ROOT = os.path.join(BASE_PATH_PIPELINE, "output")
METADATA_PATH = os.path.join(BASE_PATH_PIPELINE, "tfx_metadata", PIPELINE_NAME, "metadata.db")
DATA_PATH = os.path.join(BASE_PATH_TESTS, "iris2")
ENABLE_CACHE = True
def create_pipeline(
pipeline_name: Text, pipeline_root: Text, data_path: Text,
enable_cache: bool,
metadata_connection_config: Optional[metadata_store_pb2.ConnectionConfig] = None,
beam_pipeline_args: Optional[List[Text]] = None
):
components = []
example_gen = CsvExampleGen(input_base=data_path)
components.append(example_gen)
stat_gen = StatisticsGen(examples=example_gen.outputs['examples'])
components.append(stat_gen)
return pipeline.Pipeline(
pipeline_name = pipeline_name,
pipeline_root = pipeline_root,
components = components,
enable_cache = enable_cache,
metadata_connection_config = metadata_connection_config,
beam_pipeline_args = beam_pipeline_args
)
def run_pipeline():
this_pipeline = create_pipeline(
pipeline_name=PIPELINE_NAME,
pipeline_root=PIPELINE_ROOT,
data_path=DATA_PATH,
enable_cache=ENABLE_CACHE,
metadata_connection_config=metadata.sqlite_metadata_connection_config(METADATA_PATH)
)
BeamDagRunner().run(this_pipeline)
Also potentially useful: logger info:
INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Component CsvExampleGen depends on [].
INFO:absl:Component CsvExampleGen is scheduled.
INFO:absl:Component StatisticsGen depends on ['Run[CsvExampleGen]'].
INFO:absl:Component StatisticsGen is scheduled.
INFO:absl:Component CsvExampleGen is running.
INFO:absl:Running driver for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:select span and version = (0, None)
INFO:absl:latest span and version = (0, None)
INFO:absl:Running publisher for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Component CsvExampleGen is finished.
INFO:absl:Component StatisticsGen is running.
...
Felix, if you follow the guides you probably running your code in a notebook. If you want to see the results directly you have to enable TFX interactive using InteractiveContext.
https://www.tensorflow.org/tfx/api_docs/python/tfx/orchestration/experimental/interactive/interactive_context/InteractiveContext