How to reliably compare runtime of Haskell and C?

1.1k views Asked by At

I used Criterion library to write benchmarks for my Haskell functions. Now I am implementing the same algorithm in C to compare performance with Haskell. The question is how can I do it reliably? Criterion does a lot of fancy stuff like accounting for clock call overhead and doing statistical analysis of the results. I guess that if I just measure time needed by my C function it will not be comparable with the results returned by Criterion. In his original post about Criterion Bryan O'Sullivan writes: "It should even be easy to use criterion to benchmark C code and command line programs." The question is how? Takayuki Muranushi compares C implementation of DFT with Haskell by spawning threads and calling the executable but I fear that this adds a lot of additional overhead (create new thread, run the application, output to stdio and then reading from it) which makes the results incomparable. I also considered using FFI, but again I fear that additional overhead would make such comparison unfair.

If there is no way of using Criterion to reliably benchmark C, then what approaches to C benchmarking would you recommend? I've read some questions here on SO and it seems that there are many different functions that allow to measure system time, but they either provide time in milliseconds or have large call overhead.

1

There are 1 answers

4
Mikhail Glushenkov On BEST ANSWER

FFI can be used in such a way that it doesn't add much overhead. Consider the following program (full code available here):

foreign import ccall unsafe "mean" c_mean :: Ptr CInt -> CUInt -> IO CFloat

main :: IO ()
main = do
  buf <- mallocBytes (bufSize * sizeOfCInt)
  fillBuffer buf 0
  m <- c_mean buf (fromIntegral bufSize)
  print $ realToFrac m

The C call is compiled to the following Cmm:

s2ni_ret() { ... }
    c2qy:
        Hp = Hp + 12;
        if (Hp > I32[BaseReg + 92]) goto c2qC;
        _c2qD::I32 = I32[Sp + 4];
        (_s2m3::F32,) = foreign "ccall"
          mean((_c2qD::I32, PtrHint), (100,));

Here's the assembly:

s2ni_info:
.Lc2qy:
        addl $12,%edi
        cmpl 92(%ebx),%edi
        ja .Lc2qC
        movl 4(%ebp),%eax
        subl $4,%esp
        pushl $100
        pushl %eax
        ffree %st(0) ;ffree %st(1) ;ffree %st(2) ;ffree %st(3)
        ffree %st(4) ;ffree %st(5)
        call mean

So, if you mark your C import as unsafe and do all marshalling before measurement, your C call will be basically just an inline call instruction - the same as if you were doing all benchmarking in C. Here's what Criterion reports when I benchmark a C function that does nothing:

benchmarking c_nothing
mean: 13.99036 ns, lb 13.65144 ns, ub 14.62640 ns, ci 0.950
std dev: 2.306218 ns, lb 1.406215 ns, ub 3.541156 ns, ci 0.950
found 10 outliers among 100 samples (10.0%)
  9 (9.0%) high severe
variance introduced by outliers: 91.513%
variance is severely inflated by outliers

This is approximately 400 times smaller than the estimated clock resolution on my machine (~ 5.5 us). For comparison, here's the benchmark data for a function that computes the arithmetic mean of 100 integers:

benchmarking c_mean
mean: 184.1270 ns, lb 183.5749 ns, ub 185.0947 ns, ci 0.950
std dev: 3.651747 ns, lb 2.430552 ns, ub 5.885120 ns, ci 0.950
found 6 outliers among 100 samples (6.0%)
  5 (5.0%) high severe
variance introduced by outliers: 12.329%
variance is moderately inflated by outliers