The below code (annotated inline with locations) gives a minimal example of the puzzling behavior I'm experiencing.
Essentially, why does (2) result in terrible space/time performance while (1) does not?
The below code is compiled and run as follows on ghc version 8.4.3:
ghc -prof -fprof-auto -rtsopts test.hs; ./test +RTS -p
{-# LANGUAGE Rank2Types #-}
import Debug.Trace
-- Not sure how to get rid of the record
data State = State {
-- (0) If vstate :: Float, the extra "hello"s go away
vstate :: forall a . (Fractional a) => a
}
step :: State -> State
step s =
-- (1) one "hello" per step
-- let vs = trace "hello" (vstate s) in
-- s { vstate = vs `seq` vstate s }
-- (2) increasing "hello"s per step
s { vstate = (trace "hello" (vstate s)) `seq` vstate s }
main :: IO ()
main = do
let initState = State { vstate = 0 }
-- (3) step 3 times
-- let res = step $ step $ step initState
-- print $ vstate res
-- (4) step 20 times to profile time/space performance
let res = iterate step initState
print $ vstate $ last $ take 20 res
print "done"
a. With (1) and (3) commented in, compiled without -O2
, the code only outputs "hello" three times, as I expect it to.
b. With (2) and (3) commented in, compiled without -O2
, the code outputs "hello" eight times. It seems to output one additional "hello" per step. I don't understand why this is happening.
c. With (1) and (4) commented in, compiled without -O2
, the code runs extremely fast.
d. With (2) and (4) commented in, compiled without -O2
, the code runs very slowly, and the performance report (included below) shows that makes many more calls to vstate
and uses much more memory than variant c
. I also don't understand why this is happening.
e. With (2) and (4) commented in, compiled with -O2
, the code behaves the same as variant c
. So clearly ghc is able to optimize away whatever pathological behavior is happening in variant d
.
Here is the profiling report for variant c
(fast):
Mon Aug 13 15:48 2018 Time and Allocation Profiling Report (Final)
partial +RTS -p -RTS
total time = 0.00 secs (0 ticks @ 1000 us, 1 processor)
total alloc = 107,560 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
CAF GHC.IO.Handle.FD <entire-module> 0.0 32.3
CAF GHC.IO.Encoding <entire-module> 0.0 3.1
main Main partial.hs:(24,1)-(35,16) 0.0 13.4
main.res Main partial.hs:32:9-36 0.0 1.6
step Main partial.hs:(15,1)-(18,36) 0.0 1.1
step.vs Main partial.hs:17:9-37 0.0 46.1
individual inherited
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc
MAIN MAIN <built-in> 114 0 0.0 0.6 0.0 100.0
CAF Main <entire-module> 227 0 0.0 0.1 0.0 52.2
main Main partial.hs:(24,1)-(35,16) 228 1 0.0 2.7 0.0 52.1
vstate Main partial.hs:11:5-10 230 20 0.0 0.0 0.0 0.0
main.initState Main partial.hs:25:9-40 239 0 0.0 0.0 0.0 0.0
main.res Main partial.hs:32:9-36 234 0 0.0 0.0 0.0 0.0
step Main partial.hs:(15,1)-(18,36) 235 0 0.0 0.0 0.0 0.0
main.initState Main partial.hs:25:9-40 233 1 0.0 0.0 0.0 0.0
main.res Main partial.hs:32:9-36 231 1 0.0 1.6 0.0 49.4
step Main partial.hs:(15,1)-(18,36) 232 19 0.0 1.1 0.0 47.8
step.vs Main partial.hs:17:9-37 236 19 0.0 46.1 0.0 46.7
vstate Main partial.hs:11:5-10 237 190 0.0 0.0 0.0 0.6
main.initState Main partial.hs:25:9-40 238 0 0.0 0.6 0.0 0.6
CAF Debug.Trace <entire-module> 217 0 0.0 0.2 0.0 0.2
CAF GHC.Conc.Signal <entire-module> 206 0 0.0 0.6 0.0 0.6
CAF GHC.IO.Encoding <entire-module> 189 0 0.0 3.1 0.0 3.1
CAF GHC.IO.Encoding.Iconv <entire-module> 187 0 0.0 0.2 0.0 0.2
CAF GHC.IO.Handle.FD <entire-module> 178 0 0.0 32.3 0.0 32.3
CAF GHC.IO.Handle.Text <entire-module> 176 0 0.0 0.1 0.0 0.1
main Main partial.hs:(24,1)-(35,16) 229 0 0.0 10.7 0.0 10.7
Here is the profiling report for variant d
(slow; without -O2
):
Mon Aug 13 15:25 2018 Time and Allocation Profiling Report (Final)
partial +RTS -p -RTS
total time = 1.48 secs (1480 ticks @ 1000 us, 1 processor)
total alloc = 1,384,174,472 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
step Main partial.hs:(15,1)-(21,60) 95.7 98.8
main.initState Main partial.hs:25:9-40 3.0 1.2
vstate Main partial.hs:11:5-10 1.4 0.0
individual inherited
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc
MAIN MAIN <built-in> 114 0 0.0 0.0 100.0 100.0
CAF Main <entire-module> 227 0 0.0 0.0 100.0 100.0
main Main partial.hs:(24,1)-(35,16) 228 1 0.0 0.0 100.0 100.0
vstate Main partial.hs:11:5-10 230 1048575 1.4 0.0 100.0 100.0
main.initState Main partial.hs:25:9-40 236 0 3.0 1.2 3.0 1.2
main.res Main partial.hs:32:9-36 234 0 0.0 0.0 95.7 98.8
step Main partial.hs:(15,1)-(21,60) 235 0 95.7 98.8 95.7 98.8
main.initState Main partial.hs:25:9-40 233 1 0.0 0.0 0.0 0.0
main.res Main partial.hs:32:9-36 231 1 0.0 0.0 0.0 0.0
step Main partial.hs:(15,1)-(21,60) 232 19 0.0 0.0 0.0 0.0
CAF Debug.Trace <entire-module> 217 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal <entire-module> 206 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding <entire-module> 189 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv <entire-module> 187 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD <entire-module> 178 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.Text <entire-module> 176 0 0.0 0.0 0.0 0.0
main Main partial.hs:(24,1)-(35,16) 229 0 0.0 0.0 0.0 0.0
Here are some notes/guesses/questions on why this is happening:
- What's the relationship of the degrading performance to the increasing "hello" count? The pathological version seems to output one more "hello" with each additional step. Why?
- I'm aware that polymorphism in Haskell is slow, as detailed in this StackOverflow question. That might be part of the problem, since the pathological behavior goes away when
vstate
is monomorphized asvstate :: Float
. But I don't see why the lack of a let-binding, as in location (2), would cause such bad time/space performance. - This is the minimal version of a performance bug in a larger codebase, which we fixed by "monomorphizing" floating-type numbers using
realToFrac
(the commit is here in case anyone is curious). I know that compiling with-O2
fixes the behavior in the minimal example, but I tried it in the larger codebase and it does not fix the performance issues. (The reason we need rank-2 polymorphism in the larger codebase is to use thead
library for autodiff.) Is there a more principled fix than usingrealToFrac
, such as an inline specialization that can be applied?
forall a . (Fractional a) => a
is a function type.It has two arguments, a type
(a :: *)
and an instance with typeFractional a
. Whenever you see=>
, it's a function operationally, and compiles to a function in GHC's core representation, and sometimes stays a function in machine code as well. The main difference between->
and=>
is that arguments for the latter cannot be explicitly given by programmers, and they are always filled in implicitly by instance resolution.Let's see the fast
step
first:Here,
vs
has an undeterminedFractional
type which defaults toDouble
. If you turn on-Wtype-defaults
warning, GHC will point this out to you. Sincevs :: Double
, it's just a numeric value, which is captured by the returned closure. Right,vs `seq` f
is a function, since it has a functional typeforall a . (Fractional a) => a
, and it's desugared to an actual lambda expression by GHC. This lambda abstracts over the two arguments, capturesvs
as free variable, and passes along both arguments tof
.So, each
step
creates a new function closure which captures avs :: Double
. If we callstep
three times, we get three closures with threeDouble
s inside, each closure referring to the previous one. Then, when we writevstate (step $ step $ step initState)
, we again default toDouble
, and GHC calls this closure with theFractional Double
instance. All thevs
-es call previous closures withFractional Double
, but everyvs
is evaluated only once, because they are regular lazyDouble
values which are not recomputed.However, if we enable
NoMonomorphismRestriction
,vs
is generalized toforall a. Fractional a => a
, so it becomes a function as well, and its calls are not memoized anymore. Hence, in this case the fast version behaves the same as the slow version.Now, the slow
step
:This has exponential number of calls in the number of steps, because
step f
callsf
twice, and without optimizations no computation is shared, because both calls occur under a lambda. In(trace "hello" f) `seq` f
, the first call tof
defaults toFractional Double
, and the second call just passes along the implicitFractional a
instance as before.If we switch on optimization, GHC observes that the first
f
call does not depend on the function parameters, and floats outtrace "hello" f
to a let-binding, resulting in pretty much the same code as in the fast version.