How to insert columns in sub-tables for a grouped dataframe in `DataFrames.jl`?

48 views Asked by At

What I want to do is:

  • Group a data-frame by column ("col1") in DataFrames.jl, call it grouped_df

  • Send the sub-tables from the grouped_df to a user defined function

    • The user-defined function will transform an existing column and should add a new column with relevant data

I want to transform an old column and add a new column in the same function because both action depend on basically same compute that is recoreded slightly differently.

function apply_scan_difference(df::AbstractDataFrame,ref_scan::Float64)
    
    current_scan::Float64 = calculate_current_scan(df)
    
    scan_difference::Float64 = current_scan - ref_scan
    
    df.scan = df.scan .- scan_difference

    insertcols!(df, ncol(df)+1, :shift .= scan_difference ) # `scan_difference` has already been calculated and being able to record it here would be efficient,right`
    
    return df::AbstractDataFrame
end

As you can see scan_difference is calculated for each group and is applied to the old column scan but I also want to record it for each group. Mentally, it feels like being able to do this in the same function would be the more effiecient way to do it.

But when I combine(sdf -> apply_scan_different(sdf, ref_scan), grouped_df) I get this error: MethodError: no method matching ndims(::Type{Symbol})

I tried changing the function to

function apply_scan_difference(df::AbstractDataFrame,ref_scan::Float64)
    
    current_scan::Float64 = calculate_current_scan(df)
    
    scan_difference::Float64 = current_scan - ref_scan
    
    df.scan = df.scan .- scan_difference

    insertcols!(df, ncol(df)+1, :shift => fill(scan_difference, nrow(df))) # `scan_difference` has already been calculated and being able to record it here would be efficient,right`
    
    return df::AbstractDataFrame
end

And called the function as transform(sdf -> apply_scan_different(sdf, ref_scan), grouped_df) but that produces this error

ArgumentError: Column shift is already present in the data frame which is not allowed when makeunique=true and only applies the new column data to one group (looks like the first group).

I do not fully understand what it means to set makeunique=false`.

What is the right way to do this in Julia DataFrames.jl or does this goes against the grain and is not advisable?

2

There are 2 answers

0
BallpointBen On

The groups produced by groupby are merely views into the original DataFrame. If you add a column to one, it necessarily adds a column to the whole DataFrame. Then, insertcols errors when it tries to insert :shift into a DataFrame that already has it (since it has makeunique=false by default).

To fix this you can either copy sdf so that changes aren't propagated to the parent DataFrame, or insert an empty :shift column first and then assign in place with sdf[:, :shift] .= scan_difference.

0
Sudoh On

Answered by Bogumił Kamiński on the Julia discourse here https://discourse.julialang.org/t/how-to-insert-columns-in-sub-tables-for-a-grouped-dataframe-in-dataframes-jl/109713

The answer turned out to be

function apply_scan_difference(df::AbstractDataFrame,ref_scan::Float64)
    current_scan::Float64 = calculate_current_scan(df)
    scan_difference::Float64 = current_scan - ref_scan
    res_df = copy(df) # to dealias a data frame
    res_df.scan = df.scan .- scan_difference
    res_df.shift .= scan_difference # this assumes :shift column is not present in res_df yet
    return res_df
end

looks like copy inside a function is the way to go!