How to read all columns as str with Polars on Rust?

322 views Asked by At

How to read all columns as string with polars at rust? That's my code

let df = CsvReader::from_path(filecsv)?.has_header(true).finish()?; Ok(df)

It read some data as int64, but I want to read all columns (whenever they're calls) as string. How can I do it?

2

There are 2 answers

0
BallpointBen On

If you want all columns to be strings, you can simply use infer_schema(Some(0)). This will use 0 rows to infer the schema, which results in all columns being “inferred” as strings (the default, most general type).

For LazyCsvReader the corresponding method would be with_infer_schema_length.

0
yyyz On

Here is my solution,it can only convert some(not all) fields to str and you need to add the smartstring crate.

use polars::datatypes::DataType::Utf8;
use polars::prelude::*;
use smartstring::SmartString;
use std::sync::Arc;
fn main() {
    let mut schema = Schema::new();
    schema.with_column(SmartString::from("some_columns"), Utf8);
    let df_csv = CsvReader::from_path("some_input.csv")
        .unwrap()
        .infer_schema(None)
        .has_header(true)
        .with_dtypes(Some(Arc::new(schema)))
        .finish()
        .unwrap();
    println!("{}", df_csv);
}

=================================================

After further experimentation, I find another way. This requires using the csv crate. First, you need read the CSV headers using rdr.headers() and map them into an iterator. Then, a schema is created by using from_iter.

use polars::datatypes::DataType::Utf8;
use polars::prelude::*;
use std::sync::Arc;

fn main() {
    
    let mut rdr = csv::Reader::from_path("some_input.csv").unwrap();

    let column_names = rdr.headers().unwrap().iter().map(|item| Field::new(item,Utf8));
    let schema = Schema::from_iter(column_names);
    let df_csv = CsvReader::from_path("some_input.csv")
        .unwrap()
        .infer_schema(None)
        .has_header(true)
        .with_dtypes(Some(Arc::new(schema)))
        .finish()
        .unwrap();
    println!("{}", df_csv);
}

The drawback of this approach is that it requires reading the file twice, which can be inefficient. Unfortunately, I haven't found a suitable API to avoid this limitation at the moment.

Maybe there are better solutions...?