Use new DataFrames.combine returned column as argument in Julia

318 views Asked by At

I am trying to use DataFrames.combine to chain multiple transformations. The desired final DataFrame is the one below.

using DataFrames, Statistics

df = DataFrame(x = repeat([1], 4))

df_2 = combine(df, 
        :x => sum => :sum_x)

df_2.sqrt_sum_x .= sqrt.(df_2.sum_x)

println(df_2)
#1×2 DataFrame
# Row │ sum_x  sqrt_sum_x 
#     │ Int64  Float64
#─────┼───────────────────
#   1 │     4         2.0

I was wondering if there is any way of achieving the previous result with a single call to combine. E.g. by using the new target_cols :sum_x as a column in the argument (see code below). However, this seems to throw an ArgumentError as it can not find the newly computed :sum_x column.

combine(df, 
        :x => sum => :sum_x,
        :sum_x => sqrt => :sqrt_sum_x)
# ERROR: ArgumentError: column name :sum_x not found in the data frame
1

There are 1 answers

0
Bogumił Kamiński On BEST ANSWER

Currently this is not allowed. The reason is that the order of execution of transformations in combine is undefined. In particular, in some situations these operations are executed in parallel using multi-threading (to improve performance).

Additionally such operation could potentially be problematic in interpretation for example if you would have written:

combine(df, 
        :x => sum => :sum_x,
        [:x, :sum_x] => (+) => :x_plus_sum_x)

then in transformation:

[:x, :sum_x] => + => :x_plus_sum_x

:x would come from the source data frame df (and have 4 elements), while :sum_x would come from "yet not existent" target data frame (and have 1 element). Technically it would be possible to make it work, but we considered that this could be confusing.