I took a bit of a long break from playing with Haskell, and I'm starting to get back in to it. I'm definitely still learning my way around the language. I've realized that one of the things that has always made me nervous/uncomfortable when writing Haskell is that I don't have a strong grasp on how to craft algorithms that are both idiomatic and performant. I realize that "premature optimization is the root of all evil", but similarly slow code will have to be dealt with eventually and the I just can't get rid of my preconceived notions about languages that are so high-level being super slow.

So, in that vein, I started playing with test cases. One of them that I was working on was a naïve, straight-forward implementation of the classical 4th Order Runge-Kutta method, applied to the fairly trivial IVP dy/dt = -y; y(0) = 1, which gives y = e^-t. I wrote a completely straight forward implementation in both Haskell and C (which I'll post in a moment). The Haskell version was incredibly succinct and gave me warm fuzzies on the inside when I looked at it, but the C version (which actually wasn't horrible to parse at all) was over twice as fast.

I realize that it isn't 100% fair to compare the performance of 2 different languages; and that until the day we all die C will most likely always hold the crown as the king of performance, especially hand-optimized C code. I'm not trying to get my Haskell implementation to run just as fast as my C implementation. But I'm pretty certain that if I was more cognizant of what I was doing then I could eek a bit more speed out of this particular Haskell implementation.

The Haskell version was compiled with -02 under GHC 7.6.3 on OS X 10.8.4, the C version was compiled with Clang and I gave it no flags. The Haskell version averaged around 0.016 seconds when tracked with time, and the C version around 0.006 seconds.

These timings take in to account the entire running time of the binary, including output to stdout, which obviously accounts for some of the overhead, but I did do some profiling on the GHC binary by recompiling with -prof -auto-all and running with +RTS -p and also looking at the GC stats with +RTS -s. I didn't really understand all that much of what I saw, but it seemed to be that my GC wasn't out of control though could probably get reined in a little bit (5%, Productivity at ~93% User, ~85% total elapsed) and that most of the productive time was spent in the function iterateRK, which I knew would be slow when I wrote it but it wasn't immediately obvious to me how to go about cleaning it up. I realize that I'm probably incurring a penalty in my usage of a List, both in the constant consing and the laziness in dumping the results to stdout.

What exactly am I doing wrong? What library functions or Monadic wizardry am I tragically unaware of that I could be using to clean up iterateRK? What are some good resources for learning how to be a GHC profiling rockstar?

RK.hs

rk4 :: (Double -> Double -> Double) -> Double -> Double -> Double -> Double
rk4 y' h t y = y + (h/6) * (k1 + 2*k2 + 2*k3 + k4)
  where k1 = y' t y
        k2 = y' (t + h/2) (y + ((h/2) * k1))
        k3 = y' (t + h/2) (y + ((h/2) * k2))
        k4 = y' (t + h) (y + (h * k3))

iterateRK y' h t0 y0 = y0:(iterateRK y' h t1 y1)
  where t1 = t0 + h
        y1 = rk4 y' h t0 y0

main = do
  let y' t y = -y
  let h = 1e-3
  let y0 = 1.0
  let t0 = 0
  let results = iterateRK y' h t0 y0
  (putStrLn . show) (take 1000 results)

RK.c

#include<stdio.h>

#define ITERATIONS 1000

double rk4(double f(double t, double x), double h, double tn, double yn)
{
  double k1, k2, k3, k4;

  k1 = f(tn, yn);
  k2 = f((tn + h/2), yn + (h/2 * k1));
  k3 = f((tn + h/2), yn + (h/2 * k2));
  k4 = f(tn + h, yn + h * k3);

  return yn + (h/6) * (k1 + 2*k2 + 2*k3 + k4);
}

double expDot(double t, double x)
{
  return -x;
}

int main()
{
  double t0, y0, tn, yn, h, results[ITERATIONS];
  int i;

  h = 1e-3;
  y0 = 1.0;
  t0 = 0.0;
  yn = y0;

  for(i = 0; i < ITERATIONS; i++)
  {
    results[i] = yn;

    yn = rk4(expDot, h, tn, yn);
    tn += h;
  }

  for(i = 0; i < ITERATIONS; i++)
  {
    printf("%.10lf", results[i]);
    if(i != ITERATIONS - 1)
      printf(", ");
    else
      printf("\n");
  }

  return 0;
}
3

There are 3 answers

1
bennofs On BEST ANSWER

Using your program with increased size gives a stack overflow:

Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

This is probably caused by too much laziness. Looking at the heap profile broken down by type, you get the following:

Heao profile by type

(Note: I modified your program as leftaroundabout pointed out)

This doesn't look good. You shouldn't require linear space for your algorithm. You seem to be holding your Double values longer than required. Makeing the strict solves the issue:

{-# LANGUAGE BangPatterns #-}

iterateRK :: (Double -> Double -> Double) -> Double -> Double -> Double -> [Double]
iterateRK y' !h !t0 !y0 = y0:(iterateRK y' h t1 y1)
  where t1 = t0 + h
        y1 = rk4 y' h t0 y0

With this modification, the new heap profile looks like this:

New heap profile

This looks much better, the memory usage is much lower. -sstderr` also confirms that we only spend 2.5% of the total time in the garbage collector after the modification:

%GC     time       2.5%  (2.9% elapsed)

Now, the haskell version is still about 40% slower than the C one (using user time):

$ time ./tesths; time ./testc     
2.47e-321
./tesths  0,73s user 0,01s system 86% cpu 0,853 total
2.470328e-321
./testc  0,51s user 0,01s system 95% cpu 0,549 total

Increasing the number of iterations and using a heap-allocated array for the result storage in C lowers the difference once more:

time ./tesths; time ./testc
2.47e-321
./tesths  18,25s user 0,04s system 96% cpu 19,025 total
2.470328e-321
./testc  16,98s user 0,14s system 98% cpu 17,458 total

This is only a difference of about 9%.


But we can still do better. Using the stream-fusion package, we can eliminate the list completely while still keeping the decoupling. Here is the full code with that optimization included:

{-# LANGUAGE BangPatterns #-}
import qualified Data.List.Stream as S

rk4 :: (Double -> Double -> Double) -> Double -> Double -> Double -> Double
rk4 y' !h !t !y = y + (h/6) * (k1 + 2*k2 + 2*k3 + k4)
  where k1 = y' t y
        k2 = y' (t + h/2) (y + ((h/2) * k1))
        k3 = y' (t + h/2) (y + ((h/2) * k2))
        k4 = y' (t + h) (y + (h * k3))

iterateRK :: (Double -> Double -> Double) -> Double -> Double -> Double -> [Double]
iterateRK y' h = curry $ S.unfoldr $ \(!t0, !y0) -> Just (y0, (t0 + h, rk4 y' h t0 y0))

main :: IO ()
main = do
  let y' t y = -y
  let h = 1e-3
  let y0 = 1.0
  let t0 = 0
  let results = iterateRK y' h t0 y0
  print $ S.head $ (S.drop (pred 10000000) results)

I comiled with:

$ ghc -O2 ./test.hs -o tesths -fllvm

Here are the timings:

$ time ./tesths; time ./testc                    
2.47e-321
./tesths  15,85s user 0,02s system 97% cpu 16,200 total
2.470328e-321
./testc  16,97s user 0,18s system 97% cpu 17,538 total

Now we're even a bit faster than C, because we do no allocations. To do a similar transformation to the C program, we have to merge the two loops into one and loose the nice abstraction. Even then, it's only as fast as haskell:

$ time ./tesths; time ./testc
2.47e-321
./tesths  15,86s user 0,01s system 98% cpu 16,141 total
2.470328e-321
./testc  15,88s user 0,02s system 98% cpu 16,175 total
1
leftaroundabout On

The timings of your programs have very little to do with the languages' performance, and everything with terminal IO. Remove the printing of each step (BTW, putStrLn . show ≡≡ print) from your Haskell program, and you'll get

$ time RK-hs
1.0

real 0m0.004s
user 0m0.000s
sys 0m0.000s

... which isn't really significant, though – 1000 steps is far to little. With

main :: IO ()
main = do
    let y' t y = -y
        h = 1e-7
        y0 = 1.0
        t0 = 0
        results = iterateRK y' h t0 y0
    print . head $ drop 10000000 results

you get

$ time RK-hs +RTS -K100M
0.36787944117145965

real 0m0.653s
user 0m0.572s
sys 0m0.076s

while the equivalent in C has

$ time RK-c
Segmentation fault (core dumped)

oh great... ...but as you see, I had to increase the stack size for the Haskell program as well. Omitting the storage of the results in a stack-allocated array, we have

$ time RK-c
0.3678794412

real 0m0.152s
user 0m0.148s
sys 0m0.000s

so this is indeed faster, significantly now, than the Haskell version.

When even C has memory problems storing a whole lot of intermediate results (if you put it on the stack), this is worse in Haskell: each list node has to be heap-allocated seperately, and while allocation is much faster in Haskell's garbage-collected heap than in C's heap, it's still slow.

1
Petr On

I think that in order to make a fair comparison, you should exclude program initialization as well as printing the output (or measure it separately). By default, Haskell uses Strings which are lists of Chars and this makes output quite slow. Also Haskell has a complex runtime whose initialization can bias the results a lot for such a short task. You can use criterion library for that:

import Criterion.Main

-- ...

benchmarkIRK n =
    let y' t y = -y
        h      = 1e-3
        y0     = 1.0
        t0     = 0
    in take n (iterateRK y' h t0 y0)

benchmarkIRKPrint = writeFile "/dev/null" . show . benchmarkIRK

main = defaultMain
        [ bench "rk"      $ nf benchmarkIRK 1000
        , bench "rkPrint" $ nfIO (benchmarkIRKPrint 1000)
        ]

My measurements show that the actual computation takes something around 27 us, computing and printing takes around 350 us and running the whole program (without criterion) takes around 30 ms. So the actual computation takes just 1/1000 of the whole time and printing it just 1/100.

You should also measure your C program similarly, excluding any startup time and distinguishing what portion of time is consumed by computing and printing.