Data Frame - order consistency when converting to matrix

247 views Asked by At

I have a Deedle Frame<DateTime,string>. The columns contain float values and are dense (no missing values).

I need to build the data frame from an string [] and then:

  • Build a 2D Matrix with the whole data
  • Build a Series Series<DateTime,Matrix<float,CpuLib>>, collapsing the rows in a 1xn matrix

In my case, I am experimenting with FCore by StatFactory, but I may use another linear algebra library in the future.

My concern is that I need to make sure that the order of rows and columns is not changed in the process.

Data Frame Construction

I fetch the data using the following. I notice that the order of columns is different that the initial list of tickers. Why is that? Will the use of Array.Parallel.Map change the order?

/// get the selected tickers in a DataFrame from a DataContext  
let fetchTickers tickers joinKind =

    let getTicker ticker = 
        query {
            for row in db.PriceBarsDay do
            where (row.Ticker = ticker)
            select row } 
       |> Seq.map (fun row -> row.DateTime, float row.Close)
       |> dict

    tickers
    |> Array.map (fun ticker -> getTicker ticker)  // returns a dict(DateTime, ClosePrice)
    |> Array.map (fun dictionary -> Series(dictionary))
    |> Array.map2 (fun ticker series -> [ticker => series] |> frame ) tickers
    |> Array.reduce (fun accumFrame frame -> accumFrame.Join(frame, joinKind))

Data frame to 2D matrix

In order to build the matrix I use the code below. Mapping on the array of column names (selectedCols) ensures that the order of columns is not shifted. I run unit tests on the order of rows using Array.Map and everything looks fine but I would like to know

  • if there is a consistency check in the library that would ensure that I may not run into an issue?
  • I suppose Array.Parallel.map would preserve the order of columns.

Here is the code:

/// Build a matrix 
let buildMatrix selectedCols (frame: Frame<DateTime, String>) = 
    let matrix = 
        selectedCols 
        |> Array.map (fun colname -> frame.GetSeries(colname))
        |> Array.map (fun serie -> Series.values serie)
        |> Array.map (fun aSeq -> Seq.map unbox<float> aSeq)
        |> Array.map (fun aSeq -> Matrix(aSeq) )
        |> Array.reduce (fun acc matrix -> acc .| matrix)
    matrix.T

Data Frame to Time Series of Row Matrices

I build the time series of row matrices with the code below.

  • Keeping the data in the Series should ensure that the order of rows is preserved.
  • How can I filter the columns and ensure that the column order is exactly as in the array of column names passed on to the function?

Here is the code:

// Time series of row matrices - it'll be used to run a simulation
let timeSeriesOfMatrix frame = 
    frame
    |> Frame.filterRows (fun day target -> day >= startKalman)   
    |> Frame.mapRowValues ( fun row -> row.Values |> Seq.map unbox<float> )
    |> Series.mapValues( fun row -> Matrix(row) )

Many thanks.

PS: I kept all the three scenarios together because I believe that the three examples above would better help other users and myself understand how the library works rather than discussing each single case separately.

1

There are 1 answers

2
Tomas Petricek On BEST ANSWER

To answer the first part, the order changes because you are joining ordered frames (containing just a single series) and the frame construction preserves the ordering in this case. You can probably replace the last two lines using just Frame.ofColumns instead of using explicit join (this will always do outer join, but if you need inner join, you can then use Frame.dropSparseRows to drop the missing values).

In your second sample, everything looks good - you could save some work by getting data as a float directly;

frame.GetSeries<float>(colname).Values

The third sample also looks good and you can make it a bit shorter:

row.As<float>().Values