All benchmarks are run on either Icelake or Whiskey Lake (In Skylake Family).

Summary

I am seeing a strange phenomina where it appears that when a loop transitions from running out of the Uop Cache to running out of the LSD (Loop Stream Detector) there is a spike in Branch Misses that can cause severe performance hits. I tested on both Icelake and Whiskey Lake by benchmarking a nested loop with the outer loop having a sufficiently large body s.t the entire thing did not fit in the LSD itself, but with an inner loop small enough to fit in the LSD.

Basically once the inner loop reaches some iteration count decoding seems to switch for idq.dsb_uops (Uop Cache) to lsd.uops (LSD) and at that point there is a large increase in branch-misses (without a corresponding jump in branches) causing a severe performance drop. Note: This only seems to occur for nested loops. Travis Down's Loop Test for example will not show any meaningful variation in branch misses. AFAICT this has something to do with when a loop transitions from running out of the Uop Cache to running out of the LSD.

Questions

  1. What is happening when the loop transitions from running out of the Uop Cache to running out of the LSD that causes this spike in Branch Misses?

  2. Is there a way to avoid this?

Benchmark

This is the minimum reproducible example I could come up with:

Note: If the .p2align statements are removed both loops will fit in the LSD and there will not be a transitions.

#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))

static const uint64_t outer_N = (1UL << 24);


static void BENCH_ATTR
bench(uint64_t inner_N) {
    uint64_t inner_loop_cnt, outer_loop_cnt;
    asm volatile(
        ".p2align 12\n"
        "movl   %k[outer_N], %k[outer_loop_cnt]\n"
        ".p2align   6\n"
        "1:\n"
        "movl   %k[inner_N], %k[inner_loop_cnt]\n"
        // Extra align surrounding inner loop so that the entire thing
        // doesn't execute out of LSD.
        ".p2align   10\n"
        "2:\n"
        "decl   %k[inner_loop_cnt]\n"
        "jnz    2b\n"
        ".p2align   10\n"
        "decl   %k[outer_loop_cnt]\n"
        "jnz    1b\n"
        : [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
          [ outer_loop_cnt ] "=&r"(outer_loop_cnt)
        : [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N)
        :);
}
int
main(int argc, char ** argv) {
    assert(argc > 1);
    uint64_t inner_N = atoi(argv[1]);
    bench(inner_N);
}

Compile: gcc -O3 -march=native -mtune=native <filename>.c -o <filename>

Run Icelake: sudo perf stat -C 0 --all-user -e cycles -e branches -e branch-misses -x, -e idq.ms_uops -e idq.dsb_uops -e lsd.uops taskset -c 0 ./<filename> <N inner loop iterations>

Run Whiskey Lake: sudo perf stat -C 0 -e cycles -e branches -e branch-misses -x, -e idq.ms_uops -e idq.dsb_uops -e lsd.uops taskset -c 0 ./<filename> <N inner loop iterations>

Graphs

Edit: x label is N iterations of inner loop.

Below is a graph of Branch Misses, Branches, and LSD Uops.

Generally you can see that 1) there is no corresponding jump in Branches. 2) that the number of added Branch Misses stabilizes at a constant. And 3) That there is a strong relationship between the Branch Misses and LSD Uops.

Icelake Graph:

icl-branch

Whiskey Lake Graph:

skl-branch

Below is a graph of Branch Misses, Cycles, and LSD Uops for Icelake only as performance is not affected nearly as much on:

icl-cycles

Analysis

Hard numbers below.

For Icelake starting at N = 22 and finishing at N = 27 there is some fluctuation in the number of uops coming from the LSD vs Uop Cache and during that time Branch Misses increases by roughly 3 order of magnitude from 10^4 -> 10^7. During this period Cycles also increased by a factor of 2. For all N > 27 Branch Misses stays at around 1.67 x 10^7 (roughly outer_loop_N). For N = [17, 40] branches continues to only increase linearly.

The results for Whiskey Lake look different in that 1) N begins fluctuating at N = 35 and continues to fluctuate until N = 49. And 2) there is less of a performance impact and more fluctuation in the data. That being said the increase in Branch-Misses corresponding with transitions from uops being fed by Uop Cache to being fed by LSD still exists.

Results

Data is mean result for 25 runs.

Icelake Results:

N cycles branches branch-misses idq.ms_uops idq.dsb_uops lsd.uops
1 33893260 67129521 1590 43163 115243 83908732
2 42540891 83908928 1762 49023 142909 100690381
3 50725933 100686143 1782 47656 142506 117440256
4 67533597 117461172 1655 52538 186123 134158311
5 68022910 134238387 1711 53405 204481 150954035
6 85543126 151018722 1924 62445 141397 167633971
7 84847823 167799220 1935 60248 160146 184563523
8 101532158 184570060 1709 60064 361208 201100179
9 101864898 201347253 1773 63827 459873 217780207
10 118024033 218124499 1698 59480 177223 234834304
11 118644416 234908571 2201 62514 422977 251503052
12 134627567 251678909 1679 57262 133462 268435650
13 285607942 268456135 1770 74070 285032524 315423
14 302717754 285233352 1731 74663 302101097 15953
15 321627434 302010569 81796 77831 319192830 1520819
16 337876736 318787786 71638 77056 335904260 1265766
17 353054773 335565563 1798 79839 352434780 15879
18 369800279 352344970 1978 79863 369229396 16790
19 386921048 369119438 1972 84075 385984022 16115
20 404248461 385896655 29454 85348 402790977 510176
21 421100725 402673872 37598 83400 419537730 729397
22 519623794 419451095 4447767 91209 431865775 97827331
23 702206338 436228323 12603617 109064 427880075 327661987
24 710626194 453005538 12316933 106929 432926173 344838509
25 863214037 469782765 14641887 121776 428085132 614871430
26 761037251 486559974 13067814 113011 438093034 418124984
27 832686921 503337195 16381350 113953 421924080 556915419
28 854713119 520114412 16642396 124448 420515666 598907353
29 869873144 536891629 16572581 119280 421188631 629696780
30 889642335 553668847 16717446 120116 420086570 668628871
31 906912275 570446064 16735759 126094 419970933 702822733
32 923023862 587223281 16706519 132498 420332680 735003892
33 940308170 604000498 16744992 124365 419945191 770185745
34 957075000 620777716 16726856 133675 420215897 802779119
35 974557538 637554932 16763071 134871 419764866 838012637
36 991110971 654332149 16772560 130903 419641144 872037131
37 1008489575 671109367 16757219 138788 419900997 904638287
38 1024971256 687886583 16772585 139782 419663863 938988917
39 1041404669 704669411 16776722 137681 419617131 972896126
40 1058594326 721441018 16773492 142959 419662133 1006109192
41 1075179100 738218235 16776636 141185 419601996 1039892900
42 1092093726 754995452 16776233 142793 419611902 1073373451
43 1108706464 771773224 16776359 139500 419610885 1106976114
44 1125413652 788549886 16764637 143717 419889127 1139628280
45 1142023614 805327103 16778640 144397 419558217 1174329696
46 1158833317 822104321 16765518 148045 419889914 1206833484
47 1175665684 838881537 16778437 148347 419562885 1241397845
48 1192454164 855658755 16778865 153651 419552747 1275006511
49 1210199084 872436025 16778287 152468 419599314 1307945613
50 1226321832 889213188 16778464 155552 419572344 1341893668
51 1242886388 905990406 16778745 155401 419559249 1375589883
52 1259559053 922767623 16778809 154847 419554082 1409206082
53 1276875799 939544839 16778460 162521 419576455 1442424993
54 1293113199 956322057 16778931 154913 419550955 1476316161
55 1310449232 973099274 16778534 157364 419578102 1509485876
56 1327022109 989876491 16778794 162881 419562403 1543193559
57 1344097516 1006653708 16778906 157486 419567545 1576414302
58 1362935064 1023430928 16778959 315120 419583132 1609691339
59 1381567560 1040208143 16778564 179997 419661259 1640660745
60 1394829416 1056985359 16778779 167613 419575969 1677034188
61 1411847237 1073762626 16778071 166332 419613028 1710194702
62 1428918439 1090539795 16778409 168073 419610487 1743644637
63 1445223241 1107317011 16778486 172446 419591254 1777573503
64 1461530579 1124094228 16769606 169559 419970612 1810351736

Whiskey Lake Results:

N cycles branches branch-misses idq.dsb_uops lsd.uops
1 8332553879 35005847 37925 1799462 6019
2 8329926329 51163346 34338 1114352 5919
3 8357233041 67925775 32270 9241935 5555
4 8379609449 85364250 35667 18215077 5712
5 8394301337 101563554 33177 26392216 2159
6 8409830612 118918934 35007 35318763 5295
7 8435794672 135162597 35592 43033739 4478
8 8445843118 152636271 37802 52154850 5629
9 8459141676 168577876 30766 59245754 1543
10 8475484632 185354280 30825 68059212 4672
11 8493529857 202489273 31703 77386249 5556
12 8509281533 218912407 32133 84390084 4399
13 8528605921 236303681 33056 93995496 2093
14 8553971099 252439989 572416 99700289 2477
15 8558526147 269148605 29912 109772044 6121
16 8576658106 286414453 29839 118504526 5850
17 8591545887 302698593 28993 126409458 4865
18 8611628234 319960954 32568 136298306 5066
19 8627289083 336312187 30094 143759724 6598
20 8644741581 353730396 49458 152217853 9275
21 8685908403 369886284 1175195 161313923 7958903
22 8694494654 387336207 354008 169541244 2553802
23 8702920906 403389097 29315 176524452 12932
24 8711458401 420211718 31924 184984842 11574
25 8729941722 437299615 32472 194553843 12002
26 8743658904 453739403 28809 202074676 13279
27 8763317458 470902005 32298 211321630 15377
28 8788189716 487432842 37105 218972477 27666
29 8796580152 504414945 36756 228334744 79954
30 8821174857 520930989 39550 235849655 140461
31 8818857058 537611096 34142 648080 79191
32 8855038758 555138781 37680 18414880 70489
33 8870680446 571194669 37541 34596108 131455
34 8888946679 588222521 33724 52553756 80009
35 9256640352 604791887 16658672 132185723 41881719
36 9189040776 621918353 12296238 257921026 235389707
37 8962737456 638241888 1086663 109613368 35222987
38 9005853511 655453884 2059624 131945369 73389550
39 9005576553 671845678 1434478 143002441 51959363
40 9284680907 688991063 12776341 349762585 347998221
41 9049931865 705399210 1778532 174597773 72566933
42 9314836359 722226758 12743442 365270833 380415682
43 9072200927 739449289 1344663 205181163 61284843
44 9346737669 755766179 12681859 383580355 409359111
45 9117099955 773167996 1801713 235583664 88985013
46 9108062783 789247474 860680 250992592 43508069
47 9129892784 806871038 984804 268229102 51249366
48 9146468279 822765997 1018387 282312588 58278399
49 9476835578 840085058 13985421 241172394 809315446
50 9495578885 856579327 14155046 241909464 847629148
51 9537115189 873483093 15057500 238735335 932663942
52 9556102594 890026435 15322279 238194482 982429654
53 9589094741 907142375 15899251 234845868 1052080437
54 9609053333 923477989 16049518 233890599 1092323040
55 9628950166 940997348 16172619 235383688 1131146866
56 9650657175 957049360 16445697 231276680 1183699383
57 9666446210 973785857 16330748 233203869 1205098118
58 9687274222 990692518 16523542 230842647 1254624242
59 9706652879 1007946602 16576268 231502185 1288374980
60 9720091630 1024044005 16547047 230966608 1321807705
61 9741079017 1041285110 16635400 230873663 1362929599
62 9761596587 1057847755 16683756 230289842 1399235989
63 9782104875 1075055403 16299138 237386812 1397167324
64 9790122724 1091147494 16650471 229928585 1463076072

Edit: 2 things worth noting:

  1. If I add padding to the inner loop so it won't fit in the uop cache I don't see this behavior until ~150 iterations.

  2. Adding an lfence on with the padding in the outer loop changes N threshold to 31.

edit2: Benchmark which clears branch history The condition was reversed. It should be cmove not cmovne. With the fixed version any iteration count sees elevated Branch Misses at the same rate as above (1.67 * 10^9). This means the LSD is not itself causing Branch Misses, but leaves open the possibility that LSD is in some way defeating the Branch Predictor (what I believe to be the case).

static void BENCH_ATTR
bench(uint64_t inner_N) {
    uint64_t inner_loop_cnt, outer_loop_cnt;
    asm volatile(
        ".p2align 12\n"
        "movl   %k[outer_N], %k[outer_loop_cnt]\n"
        ".p2align   6\n"
        "1:\n"
        "testl  $3, %k[outer_loop_cnt]\n"
        "movl   $1000, %k[inner_loop_cnt]\n"
        THIS NEEDS TO BE CMOVE
        "cmovne   %k[inner_N], %k[inner_loop_cnt]\n"
        // Extra align surrounding inner loop so that the entire thing
        // doesn't execute out of LSD.
        ".p2align   10\n"
        "2:\n"
        "decl   %k[inner_loop_cnt]\n"
        "jnz    2b\n"
        ".p2align   10\n"
        "decl   %k[outer_loop_cnt]\n"
        "jnz    1b\n"
        : [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
          [ outer_loop_cnt ] "=&r"(outer_loop_cnt)
        : [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N)
        :);
}
2

There are 2 answers

52
Peter Cordes On

This may be a coincidence; similar misprediction effects happen on Skylake (with recent microcode that disables the LSD1): an inner-loop count of about 22 or 23 is enough to stop its IT-TAGE predictor from learning the pattern of 21 taken, 1 not-taken for the inner-loop branch, in exactly this simple nested loop situation, last time I tested.

Choosing that iteration threshold for when to lock the loop down into the LSD might make some sense, or be a side-effect of your 1-uop loop and the LSD's "unrolling" behaviour on Haswell and later to get multiple copies of tiny loops into the IDQ before locking it down, to reduce the impact of the loop not being a multiple of the pipeline width.

Footnote 1: I'm surprised your Whiskey Lake seems to have a working LSD; I thought LSD was still disabled in all Skylake derivatives, at least including Coffee Lake, which was launched concurrently with Whiskey Lake.


My test loop was two dec/jne loops nested simply, IIRC, but your code has padding after the inner loop. (Starting with a jmp because that's what a huge .p2align does.) This puts the two loop branches at significantly different addresses. Either of both of these differences may help them avoid aliasing or some other kind of interference, because I'm seeing mostly-correct predictions for many (but not all) counts much greater than 23.

With your test code on my i7-6700k, lsd.uops is always exactly 0 of course. Compared to your Whiskey Lake data, only a few inner-loop counts produce high mispredict rates, e.g. 40, but not 50.

So there might be some effect from the LSD on your WHL CPU, making it bad for some N values where SKL is fine. (Assuming their IT-TAGE predictors are truly identical.)

e.g. with perf stat ... -r 5 ./a.out on Skylake (i7-6700k) with microcode revision 0xe2.

N count rate variance
17 59,602 0.02% of all branches +- 10.85%
20 192,307 0.05% of all branches ( +- 44.60% )
21 79,853 0.02% of all branches ( +- 14.16% )
30 136,308 0.02% of all branches ( +- 18.57% )
31..32 similar to N=34 ( +- 2 or 3% )
33 22,415,089 3.71% of all branches ( +- 0.11% )
34 91,483 0.01% of all branches ( +- 2.36% )
35 (and 36..37 similar) 98,806 0.02% of all branches ( +- 2.75% )
38 33,517,630 4.87% of all branches ( +- 0.05% )
39 102,077 0.01% of all branches ( +- 1.96% )
40 33,458,267 4.64% of all branches ( +- 0.06% )
41 116,241 0.02% of all branches ( +- 6.86% )
42 22,376,562 2.96% of all branches ( +- 0.01% )
43 116,713 0.02% of all branches ( +- 5.25% )
44 174,834 0.02% of all branches ( +- 35.08% )
45 124,555 0.02% of all branches ( +- 5.36% )
46 135,838 0.02% of all branches ( +- 9.95% )

These numbers are repeatable, it's not just system noise; the spikes of high mispredict counts are very real at those specific N values. Probably some effect of the size / geometry of the IT-TAGE predictor's tables.

Other counters like idq.ms_uops and idq.dsb_uops scale mostly as expected, although idq.ms_uops is somewhat higher in the ones with more misses. (That counts uops added to the IDQ while the MS-ROM is active, perhaps counting front-end work that happens while branch recovery is cleaning up the back-end? It's a very different counter from legacy-decode mite_uops.)

With higher mispredict rates, idq.dsb_uops is quite a lot higher, I guess because the IDQ gets discarded and refilled on mispredicts. Like 1,011,000,288 for N=42, vs. 789,170,126 for N=43.

Note the high variability for N=20, around that threshold near 23, but still a tiny overall miss rate, much lower than every time the inner loop exits.

This is surprising, and different from a loop without as much padding.

0
Noah On

Reason

  • The cause of the spike in Branch Misses is caused by the inner loop running out of the LSD.

  • The reason the LSD causes an extra branch miss for low iteration counts is that the "stop" condition on the LSD is a branch miss.

From Intel Optimization Manual Page 86.

The loop is sent to allocation 5 µops per cycle. After 45 out of the 46 µops are sent, in the next cycle only a single µop is sent, which means that in that cycle, 4 of the allocation slots are wasted. This pattern repeats itself, until the loop is exited by a misprediction. Hardware loop unrolling minimizes the number of wasted slots during LSD.

Essentially what is happening is that when low enough iteration counts run out of the Uop Cache they are perfectly predictable. But when they run out of the LSD since the built in stop condition for the LSD is a branch mispredict, we see an extra branch miss for every iteration of the outer loop. I guess the takeaway is don't let nested loops execute out of the LSD. Note that LSD only kicks in after ~[20, 25] iterations so an inner loop with < 20 iterations will run optimally.

Benchmark

All benchmarks are run on either Icelake

The new benchmark is essentially the same as the one in the origional post but at the advise of @PeterCordes I added a fixed byte size but varying number of nops in the inner loop. The idea is fixed length so that there is no change in how the branches may alias in the BHT (Branch History Table) but varying the number of nops to sometimes defeat the LSD.

I used 124 bytes of nop padding so that the nop padding + size of decl; jcc would be 128 bytes total.

The benchmark code is as follows:

#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#ifndef INNER_NOPS
#error "INNER_NOPS must be defined"
#endif

         
#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))

static const uint64_t outer_N   = (1UL << 24);
static const uint64_t bht_shift = 4;
static const uint64_t bht_mask  = (1023 << bht_shift);

#define NOP1   ".byte 0x90\n"
#define NOP2   ".byte 0x66,0x90\n"
#define NOP3   ".byte 0x0f,0x1f,0x00\n"
#define NOP4   ".byte 0x0f,0x1f,0x40,0x00\n"
#define NOP5   ".byte 0x0f,0x1f,0x44,0x00,0x00\n"
#define NOP6   ".byte 0x66,0x0f,0x1f,0x44,0x00,0x00\n"
#define NOP7   ".byte 0x0f,0x1f,0x80,0x00,0x00,0x00,0x00\n"
#define NOP8   ".byte 0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP9   ".byte 0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP10  ".byte 0x66,0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP11  ".byte 0x66,0x66,0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"


static void BENCH_ATTR
bench(uint64_t inner_N) {
    uint64_t inner_loop_cnt, outer_loop_cnt, tmp;
    asm volatile(
        ".p2align 12\n"
        "movl   %k[outer_N], %k[outer_loop_cnt]\n"
        ".p2align   6\n"
        "1:\n"
        "movl   %k[inner_N], %k[inner_loop_cnt]\n"
        ".p2align   10\n"
        "2:\n"
        // This is defined in "inner_nops.h" with the necessary padding.
        INNER_NOPS
        "decl   %k[inner_loop_cnt]\n"
        "jnz    2b\n"
        ".p2align   10\n"
        "decl   %k[outer_loop_cnt]\n"
        "jnz    1b\n"
        : [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
          [ outer_loop_cnt ] "=&r"(outer_loop_cnt), [ tmp ] "=&r"(tmp)
        : [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N),
          [ bht_mask ] "i"(bht_mask), [ bht_shift ] "i"(bht_shift)
        :);
}
// gcc -O3 -march=native -mtune=native lsd-branchmiss.c -o lsd-branchmiss
int
main(int argc, char ** argv) {
    assert(argc > 1);
    uint64_t inner_N = atoi(argv[1]);
    bench(inner_N);
    return 0;
}

Tests

I tested nop count = [0, 39].

Note that nop count = 1 would not be only 1 nop in the inner loop but actually the following:

#define INNER_NOPS NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP3 NOP1 

To reach the full 128 byte padding.

Results

  • At nop count <= 32 the inner loop is able to run out of the LSD and we consistently see elevant Branch Misses when Iterations is large enough that it does so. Note that the elevated Branch Misses number corresponds 1-1 with the number of outer loop iterations. For these numbers outer loop iterations = 2^24

  • At nop count > 32 the loop has to many uops for the LSD and runs out of the Uop Cache. Here we do not see a consistent elevated Branch Misses until Iterations becomes to large for its BHT entry to store its entire history.

nop count > 32 (No LSD)

Once there are too many nops for the LSD the number of Branch Misses stays relatively low with a few consistent spikes until Iterations = 146 where Branch Misses spike to number of outer loop iterations (2 ^ 24 in this case) and remain constants. My guess is that is the upper bound on history the BHT is able to store.

Below is a graph of Branch Misses (Y) versus Iterations (X) for nop count = [33, 39]:

hi-bm

All of the lines follow the same patterns and have the same spikes. The large spikes to outer loop iterations before 146 are at Iterations = [42, 70, 79, 86, 88]. This is consistently reproducible. I am not sure what is special about these values.

The key point, however, is that for the most cast for Iterations = [20, 145] Branch Misses is relatively low indicating that the entire inner loop is being predicted correctly.

nop count <= 32 (LSD)

This data is a bit more noising bit all of the different nop count follow roughly the same trend of spiking initialing to within a factor of 2 of outer loop iterations Branch Misses at Iterations = [21, 25] (note this is 2-3 orders of magnitude) at the same time the lsd.oups spiked by 4-5 orders of magnitude.

There is also a trend between nop count and what iteration value Branch Misses stablize at outer loop iterations with a Pearson Correlation of 0.81. for nop count = [0, 32] the stablization point is in range iterations = [15, 34].

Below is a graph of Branch Misses (Y) versus Iterations (X) for nops = [0, 32]:

lo-bm

Generally, with some noise, all of the different nop count follow the same trend. As well they follow the same trend when compared with lsd.uops.

Below is a table with nop and the Pearson Correlation between Branch Misses and lsd.uop and idq.dsb_uops respectively.

nop lsd uop cache
0 0.961 -0.041
1 0.955 -0.081
2 0.919 -0.122
3 0.918 -0.299
4 0.947 -0.117
5 0.934 -0.298
6 0.894 -0.329
7 0.907 -0.308
8 0.91 -0.322
9 0.915 -0.316
10 0.877 -0.342
11 0.908 -0.28
12 0.874 -0.281
13 0.875 -0.523
14 0.87 -0.513
15 0.889 -0.522
16 0.858 -0.569
17 0.89 -0.507
18 0.858 -0.537
19 0.844 -0.565
20 0.816 -0.459
21 0.862 -0.537
22 0.848 -0.556
23 0.852 -0.552
24 0.85 -0.561
25 0.828 -0.573
26 0.857 -0.559
27 0.802 -0.372
28 0.762 -0.425
29 0.721 -0.112
30 0.736 -0.047
31 0.768 -0.174
32 0.847 -0.129

Which should generally indicate that there is a strong correlation between LSD and Branch Misses and no meaningful relationship between the Uop Cache and branch misses.

Overall

Generally I think it is clear that when the inner loop executing out of the LSD is what is causing the Branch Misses until Iterations becomes too large for the BHT entry's history. For the N = [33, 39] save the explained spikes we don't see elevated Branch Misses but we do for the N = [0, 32] case and the only difference I can tell is the LSD.