Context: I am using datafusion to build a data validator for a csv file input.
Requirement: I want to add row number where the error occurred in output report. In pandas, I have ability to add row index which can be used for this purpose. Is there a way to achieve similar result in datafusion.
There doesn't appear to be any easy way to do this within datafusion after opening the CSV file. But you could instead open the CSV file directly with arrow, produce a new RecordBatch that incorporates the index column, and then feed this to datafusion using a MemTable. Here's the example assuming we are only processing one batch ...
My example.csv looks like this ...
And the output should be ...
Though if you're really just in search of a crate with functionality like pandas in python, I'd urge you to checkout polars.