Why Parallel.For gives only so little gain for this particular function?

185 views Asked by At

I'm having trouble understanding why my "concurrent" implementation of this http://www.codeproject.com/Tips/447938/High-performance-Csharp-byte-array-to-hex-string-t function has only ~20% performance gain.

For convenience here is the code from that site:

static readonly int[] toHexTable = new int[] {
    3145776, 3211312, 3276848, 3342384, 3407920, 3473456, 3538992, 3604528, 3670064, 3735600,
    4259888, 4325424, 4390960, 4456496, 4522032, 4587568, 3145777, 3211313, 3276849, 3342385,
    3407921, 3473457, 3538993, 3604529, 3670065, 3735601, 4259889, 4325425, 4390961, 4456497,
    4522033, 4587569, 3145778, 3211314, 3276850, 3342386, 3407922, 3473458, 3538994, 3604530,
    3670066, 3735602, 4259890, 4325426, 4390962, 4456498, 4522034, 4587570, 3145779, 3211315,
    3276851, 3342387, 3407923, 3473459, 3538995, 3604531, 3670067, 3735603, 4259891, 4325427,
    4390963, 4456499, 4522035, 4587571, 3145780, 3211316, 3276852, 3342388, 3407924, 3473460,
    3538996, 3604532, 3670068, 3735604, 4259892, 4325428, 4390964, 4456500, 4522036, 4587572,
    3145781, 3211317, 3276853, 3342389, 3407925, 3473461, 3538997, 3604533, 3670069, 3735605,
    4259893, 4325429, 4390965, 4456501, 4522037, 4587573, 3145782, 3211318, 3276854, 3342390,
    3407926, 3473462, 3538998, 3604534, 3670070, 3735606, 4259894, 4325430, 4390966, 4456502,
    4522038, 4587574, 3145783, 3211319, 3276855, 3342391, 3407927, 3473463, 3538999, 3604535,
    3670071, 3735607, 4259895, 4325431, 4390967, 4456503, 4522039, 4587575, 3145784, 3211320,
    3276856, 3342392, 3407928, 3473464, 3539000, 3604536, 3670072, 3735608, 4259896, 4325432,
    4390968, 4456504, 4522040, 4587576, 3145785, 3211321, 3276857, 3342393, 3407929, 3473465,
    3539001, 3604537, 3670073, 3735609, 4259897, 4325433, 4390969, 4456505, 4522041, 4587577,
    3145793, 3211329, 3276865, 3342401, 3407937, 3473473, 3539009, 3604545, 3670081, 3735617,
    4259905, 4325441, 4390977, 4456513, 4522049, 4587585, 3145794, 3211330, 3276866, 3342402,
    3407938, 3473474, 3539010, 3604546, 3670082, 3735618, 4259906, 4325442, 4390978, 4456514,
    4522050, 4587586, 3145795, 3211331, 3276867, 3342403, 3407939, 3473475, 3539011, 3604547,
    3670083, 3735619, 4259907, 4325443, 4390979, 4456515, 4522051, 4587587, 3145796, 3211332,
    3276868, 3342404, 3407940, 3473476, 3539012, 3604548, 3670084, 3735620, 4259908, 4325444,
    4390980, 4456516, 4522052, 4587588, 3145797, 3211333, 3276869, 3342405, 3407941, 3473477,
    3539013, 3604549, 3670085, 3735621, 4259909, 4325445, 4390981, 4456517, 4522053, 4587589,
    3145798, 3211334, 3276870, 3342406, 3407942, 3473478, 3539014, 3604550, 3670086, 3735622,
    4259910, 4325446, 4390982, 4456518, 4522054, 4587590
};

public static unsafe string ToHex1(byte[] source)
{
    fixed (int* hexRef = toHexTable)
    fixed (byte* sourceRef = source)
    {
        byte* s = sourceRef;
        int resultLen = (source.Length << 1);

        var result = new string(' ', resultLen);
        fixed (char* resultRef = result)
        {
            int* pair = (int*)resultRef;

            while (*pair != 0)
                *pair++ = hexRef[*s++];
            return result;
        }
    }
}

Here are my "improvements":

public static unsafe string ToHex1p(byte[] source)
{
    var chunks = Environment.ProcessorCount;
    var n = (int)Math.Ceiling(source.Length / (double)chunks);

    int resultLen = (source.Length << 1);

    var result = new string(' ', resultLen);

    Parallel.For(0, chunks, k =>
    {
        var l = Math.Min(source.Length, (k + 1) * n);
        fixed (char* resultRef = result) fixed (byte* sourceRef = source)
        {
            int from = n * k;
            int to = (int)resultRef + (l << 2);

            int* pair = (int*)resultRef + from;
            byte* s = sourceRef + from;
            while ((int)pair != to)
                *pair++ = toHexTable[*s++];
        }
    });

    return result;
}


Edit 1 This is how I time the functions:

var n = 0xff;
var s = new System.Diagnostics.Stopwatch();
var d = Enumerable.Repeat<byte>(0xce, (int)Math.Pow(2, 23)).ToArray();

s.Start();
for (var i = 0; i < n; ++i)
{
    Binary.ToHex1(d);
}
Console.WriteLine(s.ElapsedMilliseconds / (double)n);

s.Restart();
for (var i = 0; i < n; ++i)
{
    Binary.ToHex1p(d);
}
Console.WriteLine(s.ElapsedMilliseconds / (double)n);
2

There are 2 answers

0
theMayer On

I read in the commentary of the first answer (which, by the way, is more informative than the question and comments altogether) that the execution time for these tests is somewhere on the order of 25ms at a maximum.

There are a lot of things to say about this, but the first is "What a waste of good programmer time!"

This is a clear case where you are over-optimizing. On my first glance through the code, I thought "why in the world would anyone want to parallelism this?" You're doing bitwise operations, which are extremely rapid. There is not enough performance penalty in the first place to justify parallelism.

Now, onto your test discrepancies. The TPL is a complex library, and parallelism is an order of magnitude more complex than single-threading. As a result, there are a number of factors that come into play when you run a test, several of which (but by no means all) are mentioned in various comments. Every factor affects the results in an unpredictable way, and you can think of each factor as adding a little bit of variation. In the end, your test results don't have enough power to distinguish between the two, which is yet another indication of over-optimization.

As programmers, we need to remember as a general rule that our time is extremely expensive, especially when compared to computer time. Opportunities to achieve true, value-added performance gain are rare when programming, and typically center around very expensive processing or cases where you need to request from remote servers, etc. There are definitely times when it is needed, but this is not one of them.

8
Peter Duniho On

After playing around with your example some, I conclude that the bulk of the disparity in timing you're seeing is due to GC overhead, with initialization overhead in both scenarios being high enough to make performance differences relatively unimportant even once the GC overhead is removed from the test.

When I switch the order of the tests, the parallel one winds up faster than the non-parallel one. This is the first sign that the test isn't fair. :)

When I changed the tests so that I called GC.Collect() after each test, to ensure the GC was quiet during the subsequent one, the parallel version consistently came out ahead. But only barely so; the start-up time for each thread was in all cases over half the total duration of each parallel test.

As part of my testing, I modified the code so that it tracked the actual time spent in each thread of the For() version. Here, I found that the time spent in this code was about what I'd expect based on the non-parallel version (i.e. reasonably close to the time divided by the thread count).

(Of course, one could obtain this information using a profiler too).

Here's the output for the tests I ran with GC.Collect(). For the parallel test, this is also showing the start (relative to the overall test start time) and duration times for each thread.

Running the non-parallel version first, parallel version second:

Single-thread version: 00:00:00.6726813
Parallel version: 00:00:00.6270247
  Thread #0: start: 00:00:00.3343985, duration: 00:00:00.2925963
  Thread #1: start: 00:00:00.3345640, duration: 00:00:00.2805527

Single-thread version: 00:00:00.7027335
Parallel version: 00:00:00.5610246
  Thread #0: start: 00:00:00.3305695, duration: 00:00:00.2304486
  Thread #1: start: 00:00:00.3305857, duration: 00:00:00.2300950

Single-thread version: 00:00:00.6609645
Parallel version: 00:00:00.6143675
  Thread #0: start: 00:00:00.3391491, duration: 00:00:00.2750529
  Thread #1: start: 00:00:00.3391560, duration: 00:00:00.2705631

Single-thread version: 00:00:00.6655265
Parallel version: 00:00:00.6246624
  Thread #0: start: 00:00:00.3227595, duration: 00:00:00.2924611
  Thread #1: start: 00:00:00.3227831, duration: 00:00:00.3018066

Single-thread version: 00:00:00.6815009
Parallel version: 00:00:00.5707794
  Thread #0: start: 00:00:00.3227074, duration: 00:00:00.2480668
  Thread #1: start: 00:00:00.3227330, duration: 00:00:00.2478351

Running the parallel version first, non-parallel second:

Parallel version: 00:00:00.5807343
  Thread #0: start: 00:00:00.3397320, duration: 00:00:00.2409767
  Thread #1: start: 00:00:00.3398103, duration: 00:00:00.2408334
Single-thread version: 00:00:00.6974992

Parallel version: 00:00:00.5801044
  Thread #0: start: 00:00:00.3305571, duration: 00:00:00.2495409
  Thread #1: start: 00:00:00.3305746, duration: 00:00:00.2492993
Single-thread version: 00:00:00.7442493

Parallel version: 00:00:00.5845514
  Thread #0: start: 00:00:00.3454512, duration: 00:00:00.2352147
  Thread #1: start: 00:00:00.3454756, duration: 00:00:00.2389522
Single-thread version: 00:00:00.6542540

Parallel version: 00:00:00.5909125
  Thread #0: start: 00:00:00.3356177, duration: 00:00:00.2550365
  Thread #1: start: 00:00:00.3356250, duration: 00:00:00.2552392
Single-thread version: 00:00:00.7609139

Parallel version: 00:00:00.5777678
  Thread #0: start: 00:00:00.3440084, duration: 00:00:00.2337504
  Thread #1: start: 00:00:00.3440323, duration: 00:00:00.2329294
Single-thread version: 00:00:00.6596119

Lessons learned:

  • Performance testing is tricky, especially in the managed environment. Things like garbage collection and just-in-time compilation make it hard to compare apples-to-apples
  • The actual computational cost of converting bytes to characters is completely inconsequential compared to anything else a program might spend its time doing (such as preparing and invoking threads). This particular algorithm does not appear worth parallelizing; even though you do get a consistent improvement in speed, it's pretty marginal due to all the overhead around the actual computation.

One final note: another source of error in these kinds of tests is Intel's Hyperthreading. Or rather, if you test while using the Hyperthread-enabled CPU count, you'll get misleading results. For example, I tested this on my Intel i5-based laptop, which reports having 4 cores. But running four threads won't come close to a 4x speed-up over a non-parallel implementation, while running two threads will be close to a 2x speed-up (for the right problem). That's why even though my computer reports 4 CPUs, I tested with only 2 threads.

Here, there's so much other misleading overhead in this test that I don't think Hyperthreading makes a big difference. But it's something to watch out for.