group by until changed sequence

260 views Asked by At

I have a big Excel file, which i read with Excel Provider in F#. The rows should be grouped by some column. Processing crashes with OutOfMemoryException. Not sure whether the Seq.groupBy call is guilty or excel type provider. To simplify it I use 3D Point here as a row.

type Point = { x : float; y: float; z: float; }
let points = seq {
    for x in 1 .. 1000 do
        for y in 1 .. 1000 do
            for z in 1 .. 1000 ->
                {x = float x; y = float y; z = float z}
    }
let groups = points |> Seq.groupBy (fun point -> point.x)

The rows are already ordered by grouped column, e.g. 10 points with x = 10, then 20 points with x = 20 and so one. Instead of grouping them I need just to split the rows in chunks until changed. Is there some way to enumerate the sequence just once and get sequence of rows splitted, not grouped, by some column value or some f(row) value?

4

There are 4 answers

5
Huusom On

Lets start with the input

let count = 1000

type Point = { x : float; y: float; z: float; }

let points = seq {
    for x in 1 .. count do
        for y in 1 .. count do
            for z in 1 .. count ->
                {x = float x; y = float y; z = float z}
    }

val count : int = 1000
type Point =
  {x: float;
   y: float;
   z: float;}
val points : seq<Point>

If we try to evalute points then we get a OutOfMemoryException:

points |> Seq.toList
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at Microsoft.FSharp.Collections.FSharpList`1.Cons(T head, FSharpList`1 tail)
   at Microsoft.FSharp.Collections.SeqModule.ToList[T](IEnumerable`1 source)
   at <StartupCode$FSI_0011>.$FSI_0011.main@()
Stopped due to error

It might be same reason that groupBy fails, but I'm not sure. But it tells us that we have to use seq and yield to return the groups with. So we get this implementation:

let group groupBy points = 
    let mutable lst = [  ]
    seq { for p in points do match lst with | [] -> lst <- [p] | p'::lst' when groupBy  p' p -> lst <- p::lst | lst' -> lst <- [p]; yield lst' }

val group : groupBy:('a -> 'a -> bool) -> points:seq<'a> -> seq<'a list>

It is not the most easily read code. It takes each point from the points sequence and prepends it to an accumulator list while the groupBy function is satisfied. If the groupBy function is not satisfied then a new accumulator list is generated and the old one is yielded. Note that the order of the accumulator list is reversed.

Testing the function:

for g in group (fun p' p -> p'.x = p.x ) points do
    printfn "%f %i" g.[0].x g.Length

Terminates nicely (after some time).

Other implementation with bug fix and better formatting.

let group (groupBy : 'a -> 'b when 'b : equality) points = 
    let mutable lst = []
    seq { 
        yield! seq { 
                   for p in points do
                       match lst with
                       | [] -> lst <- [ p ]
                       | p' :: lst' when (groupBy p') = (groupBy p) -> lst <- p :: lst
                       | lst' -> 
                           lst <- [ p ]
                           yield (groupBy lst'.Head, lst')
               } 
        yield (groupBy lst.Head, lst)
    }
0
latkin On

Another solution, along the same lines as Kevin's

module Seq = 
    let chunkBy f src = 
        seq { 
            let chunk = ResizeArray()
            let mutable key = Unchecked.defaultof<_>
            for x in src do
                let newKey = f x
                if (chunk.Count <> 0) && (newKey <> key) then 
                    yield chunk.ToArray()
                    chunk.Clear()
                key <- newKey
                chunk.Add(x)
        }

// returns 2 arrays, each with 1000 elements
points |> Seq.chunkBy (fun pt -> pt.y) |> Seq.take 2

Here's a purely functional approach, which is surely slower, and much harder to understand.

module Seq = 
    let chunkByFold f src = 
        src
        |> Seq.scan (fun (chunk, (key, carry)) x -> 
               let chunk = defaultArg carry chunk
               let newKey = f x
               if List.isEmpty chunk then [x], (newKey, None)
               elif newKey = key then x :: chunk, (key, None)
               else chunk, (newKey, Some([x]))) ([], (Unchecked.defaultof<_>, None))
        |> Seq.filter (snd >> snd >> Option.isSome)
        |> Seq.map fst
0
Kevin On

If the rows are already ordered then this chunkify function will return a seq<'a list>. Each list will contain all the points with the same x value.

let chunkify pred s = seq {
  let values = ref []
  for x in s do
    match !values with
    |h::t ->  if pred h x then
                values := x::!values
              else
                yield !values
                values := [x]
    |[] -> values := [x]
  yield !values
}

let chunked = points |> chunkify (fun x y -> x.x = y.x)

Here chunked has a type of

seq<Point list>
0
python_kaa On

Seems there is no one line purely functional solution or already defined Seq method which I have overseen.

Therefore as an alternative here my own imperative solution. Comparable to @Kevin's answer but actually satisfies more my need. The ref cell contains:

  • The group key, which is calculated just once for each row
  • The current chunk list (could be seq to be conform to Seq.groupBy), which contains the elements in the input order for which the f(x) equals to the sored group key (requires equality).

.

let splitByChanged f xs =
    let acc = ref (None,[])
    seq {
        for x in xs do
            match !acc with
            | None,_ ->
                acc := Some (f x),[x]
            | Some key, chunk when key = f x ->
                acc := Some key, x::chunk
            | Some key, chunk -> 
                let group = chunk |> Seq.toList |> List.rev
                yield key, group
                acc := Some (f x),[x]
        match !acc with
        | None,_ -> ()
        | Some key,chunk -> 
            let group = chunk |> Seq.toList |> List.rev
            yield key, group
    }
    points |> splitByChanged (fun point -> point.x)

The function has the following signature:

    val splitByChanged :
      f:('a -> 'b) -> xs:seq<'a> -> seq<'b * 'a list> when 'b : equality

Correctures and even better solutions are welcome