How to perform "ranking" (as in SAS or a "dense rank" as in SQL) in Deedle

1.4k views Asked by At

Sas has a procedure called rank that assigns a "rank" to each row in a dataframe according to the position in an ordered set of a variable, kind of; but the rank is not just the position: one has to tell the procedure how many groups use in the ranking. The rank is actually the group to which the row belongs.

In SQL terms, this is called a dense ranking.

Example (the salary variable is included for generality, but it is not used in this example):

Say we have this data frame:

Original

If we rank by age using 4 groups, sas would give us this: ranked

It is easier to understand what happened if we sort the data by the variable we ranked: ranked and sorted

Now we can see why rank gives us the position in an ordered set, kind of.

The rank procedure is very useful and cool, but I could't find in Deedle's doc how to perform it. Is there direct way to do it in Deedle or I need to create my own extension?

I suppose I could do it using these functions:

SortRows(frame, key)
chunk size series
1

There are 1 answers

0
Lay González On BEST ANSWER

I wrote my own extension:

type Frame<'TRowKey, 'TColumnKey
        when 'TRowKey : equality
        and 'TColumnKey : equality> with
    static member denseRank column (groups:int) rankName frame =
        let frameLength = Frame.countRows frame |> float
        let chunkSize =  frameLength / (float groups) |> Math.Ceiling |> int64

        let sorted =
            frame
            |> Frame.sortRows column

        let ranks =
            Frame.getCol column frame
            |> Series.map(fun k _ ->
                int ((Frame.indexForKey k sorted) / chunkSize)
            )

        let clone = frame.Clone()
        clone.AddColumn(rankName, ranks)
        clone

where indexForKey is this other custom extension:

    // index for row with key
    // index starting at 0
    static member indexForKey (key:'K) (frame:Frame<'K,_>) : int64 =
        frame.RowIndex.Locate key
        |> frame.RowIndex.AddressOperations.OffsetOf

I tried this other definition hoping that it would run faster. It is slightly faster, but not by a lot; any comments on performance issues are welcomed:

static member denseRank column (groups:int) rankName frame =
        let frameLength = Frame.countRows frame
        let chunkSize =  (float frameLength) / (float groups) |> Math.Ceiling

        let sorted =
            frame
            |> Frame.sortRows column

        let sortedKeys = Frame.getRowKeys sorted

        let ranksArr = Array.zeroCreate frameLength

        sortedKeys
        |> Seq.iteri (fun index _ -> ranksArr.[index] <- index / (int chunkSize))

        let ranks = Series(sortedKeys, ranksArr)
        let clone = frame.Clone()
        clone.AddColumn(rankName, ranks)
        clone