Why using non-temporal store instructions cannot reduce memory bandwidth usage? (Writes seem to be generating extra reads)

131 views Asked by At

I want to use the non-temporary instruction to reduce the read bandwidth generated by write allocate during the memcpy process. The expected read and write bandwidth after optimization should be the same, both equal to the actual data processing bandwidth.

But I found during the experiment that the memory read bandwidth is still 1.7x the write bandwidth.

My code is written using inline assembly, and the core logic is as follows:

asm volatile(
        "mov    %[memarea], %%rax \n"   // rax = dst arr
        "mov    %[srcarea], %%rcx \n"   // rcx = src arr
        "1: \n" // start of write loop
        "movdqa 0*16(%%rcx), %%xmm0 \n"
        "movdqa 1*16(%%rcx), %%xmm1 \n"
        "movdqa 2*16(%%rcx), %%xmm2 \n"
        "movdqa 3*16(%%rcx), %%xmm3 \n"
        "movdqa 4*16(%%rcx), %%xmm4 \n"
        "movdqa 5*16(%%rcx), %%xmm5 \n"
        "movdqa 6*16(%%rcx), %%xmm6 \n"
        "movdqa 7*16(%%rcx), %%xmm7 \n"
        "PREFETCHNTA 8*16(%%rcx) \n"
        "PREFETCHNTA 12*16(%%rcx) \n"
        "movntdq %%xmm0, 0*16(%%rax) \n"
        "movntdq %%xmm1, 1*16(%%rax) \n"
        "movntdq %%xmm2, 2*16(%%rax) \n"
        "movntdq %%xmm3, 3*16(%%rax) \n"
        "movntdq %%xmm4, 4*16(%%rax) \n"
        "movntdq %%xmm5, 5*16(%%rax) \n"
        "movntdq %%xmm6, 6*16(%%rax) \n"
        "movntdq %%xmm7, 7*16(%%rax) \n"
        "add    $8*16, %%rax \n"
        "add    $8*16, %%rcx \n"
        // test write loop condition
        "cmp    %[end], %%rax \n"       // compare to end iterator
        "jb     1b \n"
        : 
        : [memarea] "r" (dst), [srcarea] "r" (src), [end] "r" (dst+effect_size)
        : "rax", "rcx", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7", "cc", "memory");
    
    _mm_sfence();

In the memset test, the version based on non-temporary instructions does not generate read bandwidth, so the functionality of the instructions is indeed effective.

Is there anything wrong with my usage?


update:

I conducted an experiment with the code provided by Peter Cordes

  • CPU: Intel(R) Xeon(R) Gold 5218 CPU with 22MiB LLC

  • OS: CentOS7 with kernel 5.14.0

  • compiler: gcc 4.8.5 and gcc 11.2.1

  • compile option: gcc nt_memcpy.c -o nt_memcpy.exe -O2 -msse2 -mavx -std=gnu99

  • tool used for bandwidth monitor: pcm-memory

Result, with array sizes of 32MB and 1GB, calling the copy function in a loop.

  • When copying 32MB of data, I have 4284.13 MB/s reads and 3778.02 MB/s writes. (The CPU has 22 MiB of L3 cache, so this test size is too small to be a good test on this CPU, unlike Peter's i7-6700k with 8 MiB, where it was almost large enough.)
  • When copying 1GB of data, I have 6336.48 MB/s reads and 3803.00 MB/s writes
  • A baseline near-idle is 100 MB/s read, 75 MB/s write

This result does not meet the expected behavior of the non-temporal store instruction (expected read and write bandwidth should be close).

It is also different from the results obtained by Peter Cordes (see comments for details)

I can't repro your results on an i7-6700k with DDR4-2666. Read ~= write bandwidth as monitored by intel_gpu_top to get stats from the integrated memory controllers. (About 13400 MiB/s read, 13900 MiB/s write, vs. a baseline near-idle of 1200 MiB/s read, 8 to 16 MiB/s write.)

Raising the array sizes to 1GiB, my IMC read bandwidth is just idle + write bandwidth, so no excess reads.

Where may the additional memory read bandwidth in my experiment come from?


The following are the pmc collected by perf, and no significant rfo events were observed. It seems that nt-store has taken effect.

# taskset -c 1 ./perf5 stat --all-user -e task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,idq.mite_uops,offcore_requests.demand_rfo,l2_rqsts.all_rfo -- ./nt_memcpy.exe

 Performance counter stats for './nt_memcpy.exe':

     88,537.13 msec task-clock                #    1.000 CPUs utilized
             0      context-switches          #    0.000 /sec
             0      cpu-migrations            #    0.000 /sec
       278,571      page-faults               #    3.146 K/sec
239,435,668,714      cycles                    #    2.704 GHz                      (93.56%)
55,340,543,407      instructions              #    0.23  insn per cycle           (94.51%)
53,026,607,051      uops_issued.any           #  598.919 M/sec                    (93.92%)
15,151,551,517      idq.mite_uops             #  171.132 M/sec                    (93.56%)
           168      offcore_requests.demand_rfo #    1.898 /sec                     (93.75%)
           994      l2_rqsts.all_rfo          #   11.227 /sec                     (92.61%)

  88.541352219 seconds time elapsed

  85.776548000 seconds user
   2.557275000 seconds sys

update:

I conducted an experiment with a client CPU:

  • CPU: Intel(R) Core(TM) i7-10700 CPU with 16MiB LLC
  • OS: Ubunt with kernel 5.15.0
  • compiler: gcc 9.4.0
  • compile option: gcc nt_memcpy.c -o nt_memcpy.exe -O2 -msse2 -mavx -std=gnu99
  • tool used for bandwidth monitor: intel-gpu-top

Result, with array sizes of 1GB, calling the copy function in a loop.

  • When copying 1GB of data, I have 14260 MiB/s reads and 14133 MiB/s writes
  • A baseline near-idle is 48 MB/s read, 4 MB/s write

This result is in line with expectations and also in line with the data obtained by Peter.

Is it the difference in server and client CPUs, or is it the difference in tools between pcm-memory and intel-gpu-top? I am unable to use intel-gpu-top on the server (because I do not have a graphics card), and cannot use pcm-memory on the client CPU (architecture not supported), so I cannot verify each other.

0

There are 0 answers