How can I create a new column in a dataframe by using a function in Rust (Polars)?

217 views Asked by At

I have two functions below. The first function is called unlevered_beta_f, and the second function is called industry_total_beta_f. The second function uses polars from rust which helps me to manipulate a DataFrame that is being read from a CSV file. I want to create a new column using the first function, but I am not quite sure how to do it successfully.

pub fn unlevered_beta_f(
    levered_beta: f32,
    de_ratio: f32,
    marginal_tax_rate: Option<f32>,
    effective_tax_rate: f32,
    cash_firm_value: f32,
) -> Option<f32> {
    // Do you want to use marginal or effective tax trates in unlevering betas?
    // if marginal tax rate, enter the marginal tax rate to use

    let tax_rate = tax_rate_f(marginal_tax_rate, effective_tax_rate);
    let mut unlevered_beta = levered_beta / (1.0 + (1.0 - tax_rate) * de_ratio);
    unlevered_beta = unlevered_beta / (1.0 - cash_firm_value);
    return Some(unlevered_beta);
}
pub fn industry_total_beta_f(raw_data: DataFrame) -> DataFrame {
    let df = raw_data
        .clone()
        .lazy()
        .with_columns([unlevered_beta_f(
            col("Average of Beta"),
            col("Sum of Total Debt incl leases (in US $)") / col("Sum of Market Cap (in US $)"),
            marginal_tax_rate = marginal_tax_rate,
            col("Average of Effective Tax Rate"),
            col("Sum of Cash") / col("Sum of Firm Value (in US $)"),
        )
        .alias("Average Unlevered Beta")])
        .with_columns([
            (col("Average Unlevered Beta") / col("Average of Correlation with market"))
                .alias("Total Unlevered Beta"),
            (col("Average of Beta") / col("Average of Correlation with market"))
                .alias("Total Levered Beta"),
        ])
        .select([
            col("Industry Name"),
            col("Number of firms"),
            col("Average Unlevered Beta"),
            col("Average of Beta"),
            col("Average of Correlation with market"),
            col("Total Unlevered Beta"),
            col("Total Levered Beta"),
        ])
        .collect()
        .unwrap();
    return df;
}

I tried the code above, but everything works except for the following section of the code:

.with_columns([unlevered_beta_f(
    col("Average of Beta"),
    col("Sum of Total Debt incl leases (in US $)") / col("Sum of Market Cap (in US $)"),
    marginal_tax_rate = marginal_tax_rate,
    col("Average of Effective Tax Rate"),
    col("Sum of Cash") / col("Sum of Firm Value (in US $)"),
)
.alias("Average Unlevered Beta")])

I want to create a column called "Average Unlevered Beta", which takes the following columns as inputs obtained from a CSV file. In the other section of the code, I successfully created a new column, but I am not quite sure how to do it using a function.

1

There are 1 answers

0
jvanbuel On BEST ANSWER

A general remark: if you can make use of the polars Expression system, do that instead. It result in much more readable code, and is also slightly more performant for larger number of records (I did some quick benchmarks, see below).

If you can't (because, for example, the tax_rate_f function in your example is not expressible as a polars Expression), then you can apply a function to a subset of columns via the as_struct in combination with map, as explained in another SO question. Note that I'm making use here of a third party dependency, itertools, to easily iterate over multiple zipped iterators.

Based on the comments you included in your code, I assumed a very simple implementation of the tax_rate_f function. I then also implemented both the unlevered_beta_f and tax_rate_f as polars Expression functions, to show the difference in complexity.

use itertools::izip;
use polars::{
    lazy::dsl::{as_struct, GetOutput},
    prelude::*,
};
use rand::{distributions::Uniform, Rng};

const NUMBER_OF_RECORES: usize = 10000;

pub fn unlevered_beta_f(
    levered_beta: f32,
    de_ratio: f32,
    marginal_tax_rate: Option<f32>,
    effective_tax_rate: f32,
    cash_firm_value: f32,
) -> Option<f32> {
    // Do you want to use marginal or effective tax trates in unlevering betas?
    // if marginal tax rate, enter the marginal tax rate to use
    let tax_rate = tax_rate_f(marginal_tax_rate, effective_tax_rate);
    let mut unlevered_beta = levered_beta / (1.0 + (1.0 - tax_rate) * de_ratio);
    unlevered_beta /= 1.0 - cash_firm_value;
    Some(unlevered_beta)
}

pub fn tax_rate_f(marginal_tax_rate: Option<f32>, effective_tax_rate: f32) -> f32 {
    match marginal_tax_rate {
        Some(marginal_tax_rate) => marginal_tax_rate,
        None => effective_tax_rate,
    }
}

pub fn tax_rate_f_expr(marginal_tax_rate: Expr, effective_tax_rate: Expr) -> Expr {
    when(marginal_tax_rate.clone().is_not_null())
        .then(marginal_tax_rate)
        .otherwise(effective_tax_rate)
}

pub fn unlevered_beta_f_expr(
    levered_beta: Expr,
    de_ratio: Expr,
    marginal_tax_rate: Expr,
    effective_tax_rate: Expr,
    cash_firm_value: Expr,
) -> Expr {
    let tax_rate = tax_rate_f_expr(marginal_tax_rate, effective_tax_rate);
    let unlevered_beta = levered_beta / (lit(1.0) + (lit(1.0) - tax_rate) * de_ratio);
    unlevered_beta / (lit(1.0) - cash_firm_value)
}

fn main() -> Result<(), PolarsError> {
    let df = get_df()?;

    let enriched_df = df.clone().lazy().with_column(
        as_struct(vec![
            col("Average of Beta"),
            col("Sum of Total Debt incl leases (in US $)"),
            col("Sum of Market Cap (in US $)"),
            col("Average of Effective Tax Rate"),
            col("Sum of Cash"),
            col("Sum of Firm Value (in US $)"),
        ])
        .map(
            |s| {
                let cols = s.struct_()?;
                let avg_beta = cols.field_by_name("Average of Beta")?;
                let avg_beta = avg_beta.f32()?;
                let sum_debt = cols.field_by_name("Sum of Total Debt incl leases (in US $)")?;
                let sum_debt = sum_debt.f32()?;
                let sum_mkt_cap = cols.field_by_name("Sum of Market Cap (in US $)")?;
                let sum_mkt_cap = sum_mkt_cap.f32()?;
                let avg_tax_rate = cols.field_by_name("Average of Effective Tax Rate")?;
                let avg_tax_rate = avg_tax_rate.f32()?;
                let sum_cash = cols.field_by_name("Sum of Cash")?;
                let sum_cash = sum_cash.f32()?;
                let sum_firm_value = cols.field_by_name("Sum of Firm Value (in US $)")?;
                let sum_firm_value = sum_firm_value.f32()?;

                let zipped_iterables = izip!(
                    avg_beta,
                    sum_debt,
                    sum_mkt_cap,
                    avg_tax_rate,
                    sum_cash,
                    sum_firm_value
                );

                let x: ChunkedArray<Float32Type> = zipped_iterables
                    .map(
                        |(
                            avg_beta,
                            sum_debt,
                            sum_mkt_cap,
                            avg_tax_rate,
                            sum_cash,
                            sum_firm_value,
                        )| {
                            if let (
                                Some(avg_beta),
                                Some(sum_debt),
                                Some(sum_mkt_cap),
                                Some(avg_tax_rate),
                                Some(sum_cash),
                                Some(sum_firm_value),
                            ) = (
                                avg_beta,
                                sum_debt,
                                sum_mkt_cap,
                                avg_tax_rate,
                                sum_cash,
                                sum_firm_value,
                            ) {
                                unlevered_beta_f(
                                    avg_beta,
                                    sum_debt / sum_mkt_cap,
                                    None,
                                    avg_tax_rate,
                                    sum_cash / sum_firm_value,
                                )
                            } else {
                                None
                            }
                        },
                    )
                    .collect();

                Ok(Some(x.into_series()))
            },
            GetOutput::from_type(DataType::Float32),
        )
        .alias("Average Unlevered Beta"),
    );

    println!("{:?}", enriched_df.collect());

    let better_df = df
        .clone()
        .lazy()
        .with_column(lit(NULL).alias("Marginal Tax Rate"))
        .with_column(unlevered_beta_f_expr(
            col("Average of Beta"),
            col("Sum of Total Debt incl leases (in US $)") / col("Sum of Market Cap (in US $)"),
            col("Marginal Tax Rate"),
            col("Average of Effective Tax Rate"),
            col("Sum of Cash") / col("Sum of Firm Value (in US $)"),
        ).alias("Average Unlevered Beta"))
        .collect();

    print!("{:?}", better_df);

    Ok(())
}

I benchmarked both approaches using the divan crate, and got the following result for 10M records:

divan benchmark of the two different approaches

As you can see, the approach using polar's Expression syntax is slightly faster. For smaller number of records, it's actually the other way around. I'm not familiar enough with the internals of polars to explain this observation. Do take these benchmarks with a grain of salt: the random DataFrame generation is part of the benchmark, but I assume the time spend is similar for both approaches.