Julia DataFrames - Create single transformation requiring column name and local variables

Question

Julia DataFrames - Create single transformation requiring column name and local variables

59 views Asked by Mark Norris At 21 September 2023 at 22:22

I need to normalize my data columns so that the top data maps to approximately 1.0 and the bottom data maps to approximately -1.0. The data has been tagged with section column with top, falling, bottom and rising data.

using DataFrames, Statistics

df = DataFrame(
    sec = ["t", "t",  "f", "f", "f",  "b", "b",  "r", "r", "r"], 
    a = [1.01, 0.99,  0.5, 0.0, -0.5,  -0.99, -1.01,  -0.5, 0, 0.5],
    b = 0.5 .+ [1.01, 0.99,  0.5, 0.0, -0.5,  -0.99, -1.01,  -0.5, 0, 0.5],
    c = 0.5 .+ 2.0 .* [1.01, 0.99,  0.5, 0.0, -0.5,  -0.99, -1.01,  -0.5, 0, 0.5]
)

df
10×4 DataFrame
 Row │ sec     a        b        c       
     │ String  Float64  Float64  Float64 
─────┼───────────────────────────────────
   1 │ t          1.01     1.51     2.52
   2 │ t          0.99     1.49     2.48
   3 │ f          0.5      1.0      1.5
   4 │ f          0.0      0.5      0.5
   5 │ f         -0.5      0.0     -0.5
   6 │ b         -0.99    -0.49    -1.48
   7 │ b         -1.01    -0.51    -1.52
   8 │ r         -0.5      0.0     -0.5
   9 │ r          0.0      0.5      0.5
  10 │ r          0.5      1.0      1.5

I ended up with creating additional data structures, highdf and lowdf for intermediate calculations, and then normalizing each column individually using the data stored in the highdf and lowdf for that column, creating the normalized data frame one column at a time, appending it to the resultant dataframe.

highdf = combine(filter(:sec => x -> x == "t", df), Not(:sec) .=> mean, renamecols=false)
lowdf = combine(filter(:sec => x -> x == "b", df), Not(:sec) .=> mean, renamecols=false)
println("highdf = ", highdf) 
println("lowdf = ", lowdf) 

normdf = select(df, :sec)
for (colname, colvals) in pairs(eachcol(df[:, Not(:sec)]))
    normdf[!, colname] = 
        (colvals .* 2.0 .- (highdf[1, colname] + lowdf[1, colname])) .*
            (1.0 / (highdf[1, colname] - lowdf[1, colname]))
end

highdf
1×3 DataFrame
 Row │ a        b        c       
     │ Float64  Float64  Float64 
─────┼───────────────────────────
   1 │     1.0      1.5      2.5

lowdf
1×3 DataFrame
 Row │ a        b        c       
     │ Float64  Float64  Float64 
─────┼───────────────────────────
   1 │    -1.0     -0.5     -1.5


normdf
10×4 DataFrame
 Row │ sec     a        b        c       
     │ String  Float64  Float64  Float64 
─────┼───────────────────────────────────
   1 │ t          1.01     1.01     1.01
   2 │ t          0.99     0.99     0.99
   3 │ f          0.5      0.5      0.5
   4 │ f          0.0      0.0      0.0
   5 │ f         -0.5     -0.5     -0.5
   6 │ b         -0.99    -0.99    -0.99
   7 │ b         -1.01    -1.01    -1.01
   8 │ r         -0.5     -0.5     -0.5
   9 │ r          0.0      0.0      0.0
  10 │ r          0.5      0.5      0.5

I would like to simplify and speed up the code by using a single transformation. However, I ran into two problems creating an anonymous function to apply the normalization to a column.

First - to select the data used in averaging the top and bottom data requires accessing the :sec column of the data to filter for the top and bottom data. Second - storing the high a low averages of the column in local variables so that these values can be used multiple times in the normalization.

How do I write this as a single transformation like the following?

normdf = transform(
    df, 
    Not(:sec) .=> (
        colvals -> 
            local high = mean(filter(df[:, :sec] => (x .-> x == "t"), colvals))
            local lowh = mean(filter(df[:, :sec] => (x .-> x == "b"), colvals))
            (colvals .* 2.0 .- (high + low)) .* (1.0 / (high - low))
    ), 
    renamecols=false
)

Original Q&A

There are 1 answers

**Mark Norris** · Answer 1 · 2023-09-22T21:04:10+00:00

After much searching, I found Julia has two features that enable the above code to be cleanly implemented in a single transformation:

First: Logical Indexing. Julia transform only passes the colvals, not the colnames to the function. In this case, I employ logical indexing into the colvals so I don't need to know which column I'm working on.

Second: Compound Expressions. Julia interprites multiple expressions within a single begin end block as a single expression. Further, the begin end block acts as the frame for variables. New variables within this frame are declared locally to the frame, so "local" is not required.

normdf = transform(
    df, 
    Not(:sec) .=> (
        colvals -> begin
            high = mean(colvals[df[:, :sec] .== "t"])
            low = mean(colvals[df[:, :sec] .== "b"])
            (colvals .* 2.0 .- (high + low)) .* (1.0 / (high - low))
        end
    ), 
    renamecols=false
)

The compiler should note that the logical indexing for computing the mean is constant for all columns and compute the indexing only once.

TechQA.

Julia DataFrames - Create single transformation requiring column name and local variables

There are 1 answers

Related Questions in DATAFRAME

Related Questions in JULIA

Related Questions in DATAFRAMES.JL

Related Questions in JULIA-DATAFRAME

Popular Questions

Trending Questions