I need to normalize my data columns so that the top data maps to approximately 1.0 and the bottom data maps to approximately -1.0. The data has been tagged with section column with top, falling, bottom and rising data.
using DataFrames, Statistics
df = DataFrame(
sec = ["t", "t", "f", "f", "f", "b", "b", "r", "r", "r"],
a = [1.01, 0.99, 0.5, 0.0, -0.5, -0.99, -1.01, -0.5, 0, 0.5],
b = 0.5 .+ [1.01, 0.99, 0.5, 0.0, -0.5, -0.99, -1.01, -0.5, 0, 0.5],
c = 0.5 .+ 2.0 .* [1.01, 0.99, 0.5, 0.0, -0.5, -0.99, -1.01, -0.5, 0, 0.5]
)
df
10×4 DataFrame
Row │ sec a b c
│ String Float64 Float64 Float64
─────┼───────────────────────────────────
1 │ t 1.01 1.51 2.52
2 │ t 0.99 1.49 2.48
3 │ f 0.5 1.0 1.5
4 │ f 0.0 0.5 0.5
5 │ f -0.5 0.0 -0.5
6 │ b -0.99 -0.49 -1.48
7 │ b -1.01 -0.51 -1.52
8 │ r -0.5 0.0 -0.5
9 │ r 0.0 0.5 0.5
10 │ r 0.5 1.0 1.5
I ended up with creating additional data structures, highdf and lowdf for intermediate calculations, and then normalizing each column individually using the data stored in the highdf and lowdf for that column, creating the normalized data frame one column at a time, appending it to the resultant dataframe.
highdf = combine(filter(:sec => x -> x == "t", df), Not(:sec) .=> mean, renamecols=false)
lowdf = combine(filter(:sec => x -> x == "b", df), Not(:sec) .=> mean, renamecols=false)
println("highdf = ", highdf)
println("lowdf = ", lowdf)
normdf = select(df, :sec)
for (colname, colvals) in pairs(eachcol(df[:, Not(:sec)]))
normdf[!, colname] =
(colvals .* 2.0 .- (highdf[1, colname] + lowdf[1, colname])) .*
(1.0 / (highdf[1, colname] - lowdf[1, colname]))
end
highdf
1×3 DataFrame
Row │ a b c
│ Float64 Float64 Float64
─────┼───────────────────────────
1 │ 1.0 1.5 2.5
lowdf
1×3 DataFrame
Row │ a b c
│ Float64 Float64 Float64
─────┼───────────────────────────
1 │ -1.0 -0.5 -1.5
normdf
10×4 DataFrame
Row │ sec a b c
│ String Float64 Float64 Float64
─────┼───────────────────────────────────
1 │ t 1.01 1.01 1.01
2 │ t 0.99 0.99 0.99
3 │ f 0.5 0.5 0.5
4 │ f 0.0 0.0 0.0
5 │ f -0.5 -0.5 -0.5
6 │ b -0.99 -0.99 -0.99
7 │ b -1.01 -1.01 -1.01
8 │ r -0.5 -0.5 -0.5
9 │ r 0.0 0.0 0.0
10 │ r 0.5 0.5 0.5
I would like to simplify and speed up the code by using a single transformation. However, I ran into two problems creating an anonymous function to apply the normalization to a column.
First - to select the data used in averaging the top and bottom data requires accessing the :sec column of the data to filter for the top and bottom data. Second - storing the high a low averages of the column in local variables so that these values can be used multiple times in the normalization.
How do I write this as a single transformation like the following?
normdf = transform(
df,
Not(:sec) .=> (
colvals ->
local high = mean(filter(df[:, :sec] => (x .-> x == "t"), colvals))
local lowh = mean(filter(df[:, :sec] => (x .-> x == "b"), colvals))
(colvals .* 2.0 .- (high + low)) .* (1.0 / (high - low))
),
renamecols=false
)
After much searching, I found Julia has two features that enable the above code to be cleanly implemented in a single transformation:
First: Logical Indexing. Julia transform only passes the colvals, not the colnames to the function. In this case, I employ logical indexing into the colvals so I don't need to know which column I'm working on.
Second: Compound Expressions. Julia interprites multiple expressions within a single begin end block as a single expression. Further, the begin end block acts as the frame for variables. New variables within this frame are declared locally to the frame, so "local" is not required.
The compiler should note that the logical indexing for computing the mean is constant for all columns and compute the indexing only once.