V8 and big string comparison performance affected a lot by hashing the strings?

73 views Asked by At

Context: I was building a piece of the back end in node end and optimizing for bandwidth. Basically I was trying to not send thumbnails (stored as 10KiB strings) that the client already had and yada yada, for a brief time I resorted to comparing the thumbnail data directly since I couldn't be sure whether they had been updated but have moved to a different solution since. This is just a curiosity and bears no real relevance anymore.

Okay so I've discovered this behaviour by V8 where large string comparisons are slow by default but upon (presumably) hashing these strings, for instance by sticking them into a Set, the comparisons speed up a lot. I've confirmed this to be happening in both node as well as the chromium based electron front end.

Example:

    (()=>{
        //  making 10k character strings
        const big_strings = [];
        const palette = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
        for (let i = 0; i < 100; i++) {
            let big_string = "";
            for (let i = 0; i < 10_000; i++) {
                big_string += palette[Math.floor(Math.random() * palette.length)];
            }
            big_strings.push(big_string);
        }

        //  comparison functions
        function cmp1(a, b) {
            return a === b;
        }

        function cmp2(a, b) {
            const lol = new Set();
            lol.add(a);
            lol.add(b);
            return a === b;
        }

        //  tests
        let sorted_1, sorted_2;

        console.time("Test 1");
        for (let i = 0; i < 100; i++) {
            sorted_1 = big_strings.slice();
            sorted_1.sort(cmp1);
        }
        console.timeEnd("Test 1");

        console.time("Test 1");
        for (let i = 0; i < 100; i++) {
            sorted_1 = big_strings.slice();
            sorted_1.sort(cmp1);
        }
        console.timeEnd("Test 1");

        console.time("Test 1");
        for (let i = 0; i < 100; i++) {
            sorted_1 = big_strings.slice();
            sorted_1.sort(cmp1);
        }
        console.timeEnd("Test 1");

        console.time("Test 2");
        for (let i = 0; i < 100; i++) {
            sorted_2 = big_strings.slice();
            sorted_2.sort(cmp2);
        }
        console.timeEnd("Test 2");

        console.time("Test 1");
        for (let i = 0; i < 100; i++) {
            sorted_1 = big_strings.slice();
            sorted_1.sort(cmp1);
        }
        console.timeEnd("Test 1");

        //  sanity check
        let good = 0;
        const total = Math.max(sorted_1.length, sorted_2.length);
        for (let i = 0; i < total; i++) {
            if (sorted_1[i] === sorted_2[i]) {
                good++;
            } else {
                throw("Results do not match.");
            }
        }
        console.log(`Sanity check OK ${good} / ${total}`);


    })();

In the example, there are two comparison functions, one that simply does a === b and another that first sticks both a and b into a disposable Set. I can run tests with the first comparison function as many times as I like (3 in this example) and it will always be slow. Then I run the test with the second comparison function just once and afterwards, all tests on those particular strings will be blazing fast.

What gives with such behaviour? Why won't V8 hash the strings itself in the background?

1

There are 1 answers

7
jmrk On

(V8 developer here.)

It's not the hashing (or lack thereof), it's the flattening (or lack thereof) of strings that have been built up with thousands of += 'x' single-character concatenations.

With a couple of exceptions, V8 usually handles string concatenations by creating a "cons string" (sometimes also called "rope"), which is essentially a pair of pointers at its two halves; with the += 'x' strategy you're hence creating a very long linked list of single-character strings. Some operations require (and therefore cause) flattening of such strings, others don't. Always flattening cons strings automatically (doesn't matter whether it's in the background or not) would have undesirable costs in terms of CPU and memory.

Here's a way to speed up the test massively by constructing the strings more efficiently:

(()=>{
  //  making 10k character strings
  const big_strings = [];
  const palette = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
  for (let i = 0; i < 100; i++) {
    // CHANGED: [], .push(...), .join("")
    let big_string = [];
    for (let i = 0; i < 10_000; i++) {
      big_string.push(palette[Math.floor(Math.random() * palette.length)]);
    }
    big_strings.push(big_string.join(""));
    // End of changes, everything below is unchanged.
  }

  //  comparison functions
  function cmp1(a, b) {
    return a === b;
  }

  function cmp2(a, b) {
    const lol = new Set();
    lol.add(a);
    lol.add(b);
    return a === b;
  }

  //  tests
  let sorted_1, sorted_2;

  console.time("Test 1");
  for (let i = 0; i < 100; i++) {
    sorted_1 = big_strings.slice();
    sorted_1.sort(cmp1);
  }
  console.timeEnd("Test 1");

  console.time("Test 1");
  for (let i = 0; i < 100; i++) {
    sorted_1 = big_strings.slice();
    sorted_1.sort(cmp1);
  }
  console.timeEnd("Test 1");

  console.time("Test 1");
  for (let i = 0; i < 100; i++) {
    sorted_1 = big_strings.slice();
    sorted_1.sort(cmp1);
  }
  console.timeEnd("Test 1");

  console.time("Test 2");
  for (let i = 0; i < 100; i++) {
    sorted_2 = big_strings.slice();
    sorted_2.sort(cmp2);
  }
  console.timeEnd("Test 2");

  console.time("Test 1");
  for (let i = 0; i < 100; i++) {
    sorted_1 = big_strings.slice();
    sorted_1.sort(cmp1);
  }
  console.timeEnd("Test 1");

  //  sanity check
  let good = 0;
  const total = Math.max(sorted_1.length, sorted_2.length);
  for (let i = 0; i < total; i++) {
    if (sorted_1[i] === sorted_2[i]) {
      good++;
    } else {
      throw("Results do not match.");
    }
  }
  console.log(`Sanity check OK ${good} / ${total}`);

})();

Beware of microbenchmarks, because they will mislead you! If the strings in your real application are created differently, you will (quite likely) take away incorrect conclusions from your original test.