Having a CSV file with the following format:
18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu"
15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320"
18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite"
16.0 8 304.0 150.0 3433. 12.0 70 1 "amc rebel sst"
17.0 8 302.0 140.0 3449. 10.5 70 1 "ford torino"
I am able to read the csv file with Pandas as follows:
column_names = [
'MPG', 'Cylinders', 'Displacement', 'Horsepower',
'Weight', 'Acceleration', 'Model Year', 'Origin'
]
df = pd.read_csv(
DATA_PATH,
names=column_names,
na_values="?",
comment='\t',
sep=" ",
skipinitialspace=True
)
Now I am trying to read the same datase in DataFusion as follows:
use datafusion::{prelude::*};
fn get_csv_option<'a>() -> CsvReadOptions<'a> {
let mut csv_opt = CsvReadOptions::new();
csv_opt.has_header = false;
csv_opt.delimiter = b' ';
csv_opt
}
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
let read_option = get_csv_option();
let ctx = SessionContext::new();
let df = ctx.read_csv("data/landing/auto-mpg.data", read_option).await?;
println!("{}", df.schema());
df.show().await?;
Ok(())
}
which produce nothing and the final output is:
(mpg-car-pipeline-U1cqCC4U-py3.9) datapsycho@dataops ~/.../mpg-car-pipeline $ cargo run
Compiling mpg-car-pipeline v0.1.0 (/home/datapsycho/RustProjects/mpg-car-pipeline)
Finished dev [unoptimized + debuginfo] target(s) in 10.22s
Running `target/debug/mpg-car-pipeline`
fields:[], metadata:{}
++
++
How can I read the data in DataFusion with added column name and schema? I have looked into the API doc but there is not enough example on CsvReadOptions struct. Data file can be downloaded with the following command:
wget http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data