I want to use the non-temporary instruction to reduce the read bandwidth generated by write allocate during the memcpy process. The expected read and write bandwidth after optimization should be the same, both equal to the actual data processing bandwidth.
But I found during the experiment that the memory read bandwidth is still 1.7x the write bandwidth.
My code is written using inline assembly, and the core logic is as follows:
asm volatile(
"mov %[memarea], %%rax \n" // rax = dst arr
"mov %[srcarea], %%rcx \n" // rcx = src arr
"1: \n" // start of write loop
"movdqa 0*16(%%rcx), %%xmm0 \n"
"movdqa 1*16(%%rcx), %%xmm1 \n"
"movdqa 2*16(%%rcx), %%xmm2 \n"
"movdqa 3*16(%%rcx), %%xmm3 \n"
"movdqa 4*16(%%rcx), %%xmm4 \n"
"movdqa 5*16(%%rcx), %%xmm5 \n"
"movdqa 6*16(%%rcx), %%xmm6 \n"
"movdqa 7*16(%%rcx), %%xmm7 \n"
"PREFETCHNTA 8*16(%%rcx) \n"
"PREFETCHNTA 12*16(%%rcx) \n"
"movntdq %%xmm0, 0*16(%%rax) \n"
"movntdq %%xmm1, 1*16(%%rax) \n"
"movntdq %%xmm2, 2*16(%%rax) \n"
"movntdq %%xmm3, 3*16(%%rax) \n"
"movntdq %%xmm4, 4*16(%%rax) \n"
"movntdq %%xmm5, 5*16(%%rax) \n"
"movntdq %%xmm6, 6*16(%%rax) \n"
"movntdq %%xmm7, 7*16(%%rax) \n"
"add $8*16, %%rax \n"
"add $8*16, %%rcx \n"
// test write loop condition
"cmp %[end], %%rax \n" // compare to end iterator
"jb 1b \n"
:
: [memarea] "r" (dst), [srcarea] "r" (src), [end] "r" (dst+effect_size)
: "rax", "rcx", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7", "cc", "memory");
_mm_sfence();
In the memset test, the version based on non-temporary instructions does not generate read bandwidth, so the functionality of the instructions is indeed effective.
Is there anything wrong with my usage?
update:
I conducted an experiment with the code provided by Peter Cordes
CPU: Intel(R) Xeon(R) Gold 5218 CPU with 22MiB LLC
OS: CentOS7 with kernel 5.14.0
compiler: gcc 4.8.5 and gcc 11.2.1
compile option:
gcc nt_memcpy.c -o nt_memcpy.exe -O2 -msse2 -mavx -std=gnu99
tool used for bandwidth monitor:
pcm-memory
Result, with array sizes of 32MB and 1GB, calling the copy function in a loop.
- When copying 32MB of data, I have 4284.13 MB/s reads and 3778.02 MB/s writes. (The CPU has 22 MiB of L3 cache, so this test size is too small to be a good test on this CPU, unlike Peter's i7-6700k with 8 MiB, where it was almost large enough.)
- When copying 1GB of data, I have 6336.48 MB/s reads and 3803.00 MB/s writes
- A baseline near-idle is 100 MB/s read, 75 MB/s write
This result does not meet the expected behavior of the non-temporal store instruction (expected read and write bandwidth should be close).
It is also different from the results obtained by Peter Cordes (see comments for details)
I can't repro your results on an i7-6700k with DDR4-2666. Read ~= write bandwidth as monitored by intel_gpu_top to get stats from the integrated memory controllers. (About 13400 MiB/s read, 13900 MiB/s write, vs. a baseline near-idle of 1200 MiB/s read, 8 to 16 MiB/s write.)
Raising the array sizes to 1GiB, my IMC read bandwidth is just idle + write bandwidth, so no excess reads.
Where may the additional memory read bandwidth in my experiment come from?
The following are the pmc collected by perf, and no significant rfo events were observed. It seems that nt-store has taken effect.
# taskset -c 1 ./perf5 stat --all-user -e task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,idq.mite_uops,offcore_requests.demand_rfo,l2_rqsts.all_rfo -- ./nt_memcpy.exe
Performance counter stats for './nt_memcpy.exe':
88,537.13 msec task-clock # 1.000 CPUs utilized
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
278,571 page-faults # 3.146 K/sec
239,435,668,714 cycles # 2.704 GHz (93.56%)
55,340,543,407 instructions # 0.23 insn per cycle (94.51%)
53,026,607,051 uops_issued.any # 598.919 M/sec (93.92%)
15,151,551,517 idq.mite_uops # 171.132 M/sec (93.56%)
168 offcore_requests.demand_rfo # 1.898 /sec (93.75%)
994 l2_rqsts.all_rfo # 11.227 /sec (92.61%)
88.541352219 seconds time elapsed
85.776548000 seconds user
2.557275000 seconds sys
update:
I conducted an experiment with a client CPU:
- CPU: Intel(R) Core(TM) i7-10700 CPU with 16MiB LLC
- OS: Ubunt with kernel 5.15.0
- compiler: gcc 9.4.0
- compile option:
gcc nt_memcpy.c -o nt_memcpy.exe -O2 -msse2 -mavx -std=gnu99
- tool used for bandwidth monitor:
intel-gpu-top
Result, with array sizes of 1GB, calling the copy function in a loop.
- When copying 1GB of data, I have 14260 MiB/s reads and 14133 MiB/s writes
- A baseline near-idle is 48 MB/s read, 4 MB/s write
This result is in line with expectations and also in line with the data obtained by Peter.
Is it the difference in server and client CPUs, or is it the difference in tools between pcm-memory
and intel-gpu-top
? I am unable to use intel-gpu-top
on the server (because I do not have a graphics card), and cannot use pcm-memory
on the client CPU (architecture not supported), so I cannot verify each other.