I took a bit of a long break from playing with Haskell, and I'm starting to get back in to it. I'm definitely still learning my way around the language. I've realized that one of the things that has always made me nervous/uncomfortable when writing Haskell is that I don't have a strong grasp on how to craft algorithms that are both idiomatic and performant. I realize that "premature optimization is the root of all evil", but similarly slow code will have to be dealt with eventually and the I just can't get rid of my preconceived notions about languages that are so high-level being super slow.
So, in that vein, I started playing with test cases. One of them that I was working on was a naïve, straight-forward implementation of the classical 4th Order Runge-Kutta method, applied to the fairly trivial IVP dy/dt = -y; y(0) = 1
, which gives y = e^-t
. I wrote a completely straight forward implementation in both Haskell and C (which I'll post in a moment). The Haskell version was incredibly succinct and gave me warm fuzzies on the inside when I looked at it, but the C version (which actually wasn't horrible to parse at all) was over twice as fast.
I realize that it isn't 100% fair to compare the performance of 2 different languages; and that until the day we all die C will most likely always hold the crown as the king of performance, especially hand-optimized C code. I'm not trying to get my Haskell implementation to run just as fast as my C implementation. But I'm pretty certain that if I was more cognizant of what I was doing then I could eek a bit more speed out of this particular Haskell implementation.
The Haskell version was compiled with -02
under GHC 7.6.3 on OS X 10.8.4, the C version was compiled with Clang and I gave it no flags. The Haskell version averaged around 0.016 seconds when tracked with time
, and the C version around 0.006 seconds.
These timings take in to account the entire running time of the binary, including output to stdout, which obviously accounts for some of the overhead, but I did do some profiling on the GHC binary by recompiling with -prof -auto-all
and running with +RTS -p
and also looking at the GC stats with +RTS -s
. I didn't really understand all that much of what I saw, but it seemed to be that my GC wasn't out of control though could probably get reined in a little bit (5%, Productivity at ~93% User, ~85% total elapsed) and that most of the productive time was spent in the function iterateRK
, which I knew would be slow when I wrote it but it wasn't immediately obvious to me how to go about cleaning it up. I realize that I'm probably incurring a penalty in my usage of a List, both in the constant cons
ing and the laziness in dumping the results to stdout.
What exactly am I doing wrong? What library functions or Monadic wizardry am I tragically unaware of that I could be using to clean up iterateRK
? What are some good resources for learning how to be a GHC profiling rockstar?
RK.hs
rk4 :: (Double -> Double -> Double) -> Double -> Double -> Double -> Double
rk4 y' h t y = y + (h/6) * (k1 + 2*k2 + 2*k3 + k4)
where k1 = y' t y
k2 = y' (t + h/2) (y + ((h/2) * k1))
k3 = y' (t + h/2) (y + ((h/2) * k2))
k4 = y' (t + h) (y + (h * k3))
iterateRK y' h t0 y0 = y0:(iterateRK y' h t1 y1)
where t1 = t0 + h
y1 = rk4 y' h t0 y0
main = do
let y' t y = -y
let h = 1e-3
let y0 = 1.0
let t0 = 0
let results = iterateRK y' h t0 y0
(putStrLn . show) (take 1000 results)
RK.c
#include<stdio.h>
#define ITERATIONS 1000
double rk4(double f(double t, double x), double h, double tn, double yn)
{
double k1, k2, k3, k4;
k1 = f(tn, yn);
k2 = f((tn + h/2), yn + (h/2 * k1));
k3 = f((tn + h/2), yn + (h/2 * k2));
k4 = f(tn + h, yn + h * k3);
return yn + (h/6) * (k1 + 2*k2 + 2*k3 + k4);
}
double expDot(double t, double x)
{
return -x;
}
int main()
{
double t0, y0, tn, yn, h, results[ITERATIONS];
int i;
h = 1e-3;
y0 = 1.0;
t0 = 0.0;
yn = y0;
for(i = 0; i < ITERATIONS; i++)
{
results[i] = yn;
yn = rk4(expDot, h, tn, yn);
tn += h;
}
for(i = 0; i < ITERATIONS; i++)
{
printf("%.10lf", results[i]);
if(i != ITERATIONS - 1)
printf(", ");
else
printf("\n");
}
return 0;
}
Using your program with increased size gives a stack overflow:
This is probably caused by too much laziness. Looking at the heap profile broken down by type, you get the following:
(Note: I modified your program as leftaroundabout pointed out)
This doesn't look good. You shouldn't require linear space for your algorithm. You seem to be holding your Double values longer than required. Makeing the strict solves the issue:
With this modification, the new heap profile looks like this:
This looks much better, the memory usage is much lower. -sstderr` also confirms that we only spend 2.5% of the total time in the garbage collector after the modification:
Now, the haskell version is still about 40% slower than the C one (using user time):
Increasing the number of iterations and using a heap-allocated array for the result storage in C lowers the difference once more:
This is only a difference of about 9%.
But we can still do better. Using the stream-fusion package, we can eliminate the list completely while still keeping the decoupling. Here is the full code with that optimization included:
I comiled with:
Here are the timings:
Now we're even a bit faster than C, because we do no allocations. To do a similar transformation to the C program, we have to merge the two loops into one and loose the nice abstraction. Even then, it's only as fast as haskell: