Puzzling performance/output behavior with rank-2 polymorphism in Haskell

160 views Asked by At

The below code (annotated inline with locations) gives a minimal example of the puzzling behavior I'm experiencing.

Essentially, why does (2) result in terrible space/time performance while (1) does not?

The below code is compiled and run as follows on ghc version 8.4.3: ghc -prof -fprof-auto -rtsopts test.hs; ./test +RTS -p

{-# LANGUAGE Rank2Types #-}

import Debug.Trace

-- Not sure how to get rid of the record
data State = State {
    -- (0) If vstate :: Float, the extra "hello"s go away
    vstate :: forall a . (Fractional a) => a
}

step :: State -> State
step s =
    -- (1) one "hello" per step
    -- let vs = trace "hello" (vstate s) in
    -- s { vstate = vs `seq` vstate s }

    -- (2) increasing "hello"s per step
    s { vstate = (trace "hello" (vstate s)) `seq` vstate s }

main :: IO ()
main = do
    let initState = State { vstate = 0 }

    -- (3) step 3 times
    -- let res = step $ step $ step initState
    -- print $ vstate res

    -- (4) step 20 times to profile time/space performance
    let res = iterate step initState
    print $ vstate $ last $ take 20 res

    print "done"

a. With (1) and (3) commented in, compiled without -O2, the code only outputs "hello" three times, as I expect it to.

b. With (2) and (3) commented in, compiled without -O2, the code outputs "hello" eight times. It seems to output one additional "hello" per step. I don't understand why this is happening.

c. With (1) and (4) commented in, compiled without -O2, the code runs extremely fast.

d. With (2) and (4) commented in, compiled without -O2, the code runs very slowly, and the performance report (included below) shows that makes many more calls to vstate and uses much more memory than variant c. I also don't understand why this is happening.

e. With (2) and (4) commented in, compiled with -O2, the code behaves the same as variant c. So clearly ghc is able to optimize away whatever pathological behavior is happening in variant d.

Here is the profiling report for variant c (fast):

    Mon Aug 13 15:48 2018 Time and Allocation Profiling Report  (Final)

       partial +RTS -p -RTS

    total time  =        0.00 secs   (0 ticks @ 1000 us, 1 processor)
    total alloc =     107,560 bytes  (excludes profiling overheads)

COST CENTRE MODULE           SRC                        %time %alloc

CAF         GHC.IO.Handle.FD <entire-module>              0.0   32.3
CAF         GHC.IO.Encoding  <entire-module>              0.0    3.1
main        Main             partial.hs:(24,1)-(35,16)    0.0   13.4
main.res    Main             partial.hs:32:9-36           0.0    1.6
step        Main             partial.hs:(15,1)-(18,36)    0.0    1.1
step.vs     Main             partial.hs:17:9-37           0.0   46.1


                                                                                         individual      inherited
COST CENTRE           MODULE                SRC                       no.     entries  %time %alloc   %time %alloc

MAIN                  MAIN                  <built-in>                114          0    0.0    0.6     0.0  100.0
 CAF                  Main                  <entire-module>           227          0    0.0    0.1     0.0   52.2
  main                Main                  partial.hs:(24,1)-(35,16) 228          1    0.0    2.7     0.0   52.1
   vstate             Main                  partial.hs:11:5-10        230         20    0.0    0.0     0.0    0.0
    main.initState    Main                  partial.hs:25:9-40        239          0    0.0    0.0     0.0    0.0
    main.res          Main                  partial.hs:32:9-36        234          0    0.0    0.0     0.0    0.0
     step             Main                  partial.hs:(15,1)-(18,36) 235          0    0.0    0.0     0.0    0.0
   main.initState     Main                  partial.hs:25:9-40        233          1    0.0    0.0     0.0    0.0
   main.res           Main                  partial.hs:32:9-36        231          1    0.0    1.6     0.0   49.4
    step              Main                  partial.hs:(15,1)-(18,36) 232         19    0.0    1.1     0.0   47.8
     step.vs          Main                  partial.hs:17:9-37        236         19    0.0   46.1     0.0   46.7
      vstate          Main                  partial.hs:11:5-10        237        190    0.0    0.0     0.0    0.6
       main.initState Main                  partial.hs:25:9-40        238          0    0.0    0.6     0.0    0.6
 CAF                  Debug.Trace           <entire-module>           217          0    0.0    0.2     0.0    0.2
 CAF                  GHC.Conc.Signal       <entire-module>           206          0    0.0    0.6     0.0    0.6
 CAF                  GHC.IO.Encoding       <entire-module>           189          0    0.0    3.1     0.0    3.1
 CAF                  GHC.IO.Encoding.Iconv <entire-module>           187          0    0.0    0.2     0.0    0.2
 CAF                  GHC.IO.Handle.FD      <entire-module>           178          0    0.0   32.3     0.0   32.3
 CAF                  GHC.IO.Handle.Text    <entire-module>           176          0    0.0    0.1     0.0    0.1
 main                 Main                  partial.hs:(24,1)-(35,16) 229          0    0.0   10.7     0.0   10.7 

Here is the profiling report for variant d (slow; without -O2):

    Mon Aug 13 15:25 2018 Time and Allocation Profiling Report  (Final)

       partial +RTS -p -RTS

    total time  =        1.48 secs   (1480 ticks @ 1000 us, 1 processor)
    total alloc = 1,384,174,472 bytes  (excludes profiling overheads)

COST CENTRE    MODULE    SRC                        %time %alloc

step           Main      partial.hs:(15,1)-(21,60)   95.7   98.8
main.initState Main      partial.hs:25:9-40           3.0    1.2
vstate         Main      partial.hs:11:5-10           1.4    0.0


                                                                                      individual      inherited
COST CENTRE        MODULE                SRC                       no.     entries  %time %alloc   %time %alloc

MAIN               MAIN                  <built-in>                114          0    0.0    0.0   100.0  100.0
 CAF               Main                  <entire-module>           227          0    0.0    0.0   100.0  100.0
  main             Main                  partial.hs:(24,1)-(35,16) 228          1    0.0    0.0   100.0  100.0
   vstate          Main                  partial.hs:11:5-10        230    1048575    1.4    0.0   100.0  100.0
    main.initState Main                  partial.hs:25:9-40        236          0    3.0    1.2     3.0    1.2
    main.res       Main                  partial.hs:32:9-36        234          0    0.0    0.0    95.7   98.8
     step          Main                  partial.hs:(15,1)-(21,60) 235          0   95.7   98.8    95.7   98.8
   main.initState  Main                  partial.hs:25:9-40        233          1    0.0    0.0     0.0    0.0
   main.res        Main                  partial.hs:32:9-36        231          1    0.0    0.0     0.0    0.0
    step           Main                  partial.hs:(15,1)-(21,60) 232         19    0.0    0.0     0.0    0.0
 CAF               Debug.Trace           <entire-module>           217          0    0.0    0.0     0.0    0.0
 CAF               GHC.Conc.Signal       <entire-module>           206          0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.Encoding       <entire-module>           189          0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.Encoding.Iconv <entire-module>           187          0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.Handle.FD      <entire-module>           178          0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.Handle.Text    <entire-module>           176          0    0.0    0.0     0.0    0.0
 main              Main                  partial.hs:(24,1)-(35,16) 229          0    0.0    0.0     0.0    0.0

Here are some notes/guesses/questions on why this is happening:

  • What's the relationship of the degrading performance to the increasing "hello" count? The pathological version seems to output one more "hello" with each additional step. Why?
  • I'm aware that polymorphism in Haskell is slow, as detailed in this StackOverflow question. That might be part of the problem, since the pathological behavior goes away when vstate is monomorphized as vstate :: Float. But I don't see why the lack of a let-binding, as in location (2), would cause such bad time/space performance.
  • This is the minimal version of a performance bug in a larger codebase, which we fixed by "monomorphizing" floating-type numbers using realToFrac (the commit is here in case anyone is curious). I know that compiling with -O2 fixes the behavior in the minimal example, but I tried it in the larger codebase and it does not fix the performance issues. (The reason we need rank-2 polymorphism in the larger codebase is to use the ad library for autodiff.) Is there a more principled fix than using realToFrac, such as an inline specialization that can be applied?
1

There are 1 answers

0
András Kovács On BEST ANSWER

forall a . (Fractional a) => a is a function type.

It has two arguments, a type (a :: *) and an instance with type Fractional a. Whenever you see =>, it's a function operationally, and compiles to a function in GHC's core representation, and sometimes stays a function in machine code as well. The main difference between -> and => is that arguments for the latter cannot be explicitly given by programmers, and they are always filled in implicitly by instance resolution.

Let's see the fast step first:

step :: State -> State
step (State f) =
    let vs = trace "hello" f
    in State (vs `seq` f)

Here, vs has an undetermined Fractional type which defaults to Double. If you turn on -Wtype-defaults warning, GHC will point this out to you. Since vs :: Double, it's just a numeric value, which is captured by the returned closure. Right, vs `seq` f is a function, since it has a functional type forall a . (Fractional a) => a, and it's desugared to an actual lambda expression by GHC. This lambda abstracts over the two arguments, captures vs as free variable, and passes along both arguments to f.

So, each step creates a new function closure which captures a vs :: Double. If we call step three times, we get three closures with three Doubles inside, each closure referring to the previous one. Then, when we write vstate (step $ step $ step initState), we again default to Double, and GHC calls this closure with the Fractional Double instance. All the vs-es call previous closures with Fractional Double, but every vs is evaluated only once, because they are regular lazy Double values which are not recomputed.

However, if we enable NoMonomorphismRestriction, vs is generalized to forall a. Fractional a => a, so it becomes a function as well, and its calls are not memoized anymore. Hence, in this case the fast version behaves the same as the slow version.

Now, the slow step:

step :: State -> State
step (State f) = State ((trace "hello" f) `seq` f)

This has exponential number of calls in the number of steps, because step f calls f twice, and without optimizations no computation is shared, because both calls occur under a lambda. In (trace "hello" f) `seq` f, the first call to f defaults to Fractional Double, and the second call just passes along the implicit Fractional a instance as before.

If we switch on optimization, GHC observes that the first f call does not depend on the function parameters, and floats out trace "hello" f to a let-binding, resulting in pretty much the same code as in the fast version.