the detail of test program
I tested it in user mode using examples from Intel SDM, The details of this test are as follows:
- Provide an global lock using SystemV shared memory for multiple processes use it
- Use the assembly example in Intel SDM as the main code for lock grabbing:
int get_lock()
{
int locked_val = 1;
__asm__(
"Spin_Lock: \n"
"cmpq $0, (%[global_lock_addr]) \n"
"je Get_Lock \n"
FILL_INST "\n" //FILL_INST is "nop" or "pause"
"jmp Spin_Lock \n"
"Get_Lock: \n"
"mov $1, %[locked_val] \n"
"xchg %[locked_val], (%[global_lock_addr]) \n"
"cmp $0, %[locked_val]\n"
"jne Spin_Lock \n"
"Get_Lock_Success:"
::
[global_lock_addr] "r" (global_lock),
[locked_val] "r" (locked_val)
:
);
return 0;
}
- The implementation of release lock is as follows:
int release_lock()
{
int unlock_val = 0;
__asm("xchg %[unlock_val], (%[global_lock_addr])"::
[global_lock_addr] "r" (global_lock),
[unlock_val] "r" (unlock_val)
:);
}
- The process will exit after obtaining and releasing the lock a certain number of times
int main()
{
...
printf("exec lock \n");
for (i = 0; i < LOOP_TIME; i++) {
get_lock();
release_lock();
}
...
}
- Two executable programs can be compiled using this Makefile
spinlock_pause:FILL_INSTmacro is defined as "pause" stringspinlock_nopause:FILL_INSTmacro is defined as "nop" string
- By using the
exec.shscript, compilation and running can be completed, provided that theCUR_EXEC_NUMenvironment variable needs to be defined
This variable indicate how many processes will be started
- Using the perf command to collect
machine_clears.memory_orderingandinst_retired.anyevents for executing a program - Save the test results in the ./log directory
test result
Test environment
- x86 E5-2666 v3
- core number: 20 , Thread per core : 2
- export CUR_EXEC_NUM 40 (one program work on per thread)
result
- spinlock_pause
get lock time 5000000
Performance counter stats for '././spinlock_pause':
5,228,707 machine_clears.memory_ordering:u
5,018,001,922 inst_retired.any:u
78.053822086 seconds time elapsed
76.887470000 seconds user
0.022657000 seconds sys
- spinlock_nopause
get lock time 5000000
Performance counter stats for '././spinlock_nopause':
74,524,989 machine_clears.memory_ordering:u
21,212,346,839 inst_retired.any:u
73.076739387 seconds time elapsed
72.129267000 seconds user
0.010899000 seconds sys
From the above results, it can be seen that the pause instruction
can reduce machine_clears.memory_ordering event count.
(I don't know if this event is equal to memory order violation),
but due to the cycle of pause instruction is larger to decrease in the
number of instructions executed (inst-reired), I don't think
this test can indicate that pause can avoid machines_clears.memory_ordering
On the other hand, the total execution time of programs using
the pause instruction is NOT LESS than that of programs using
the nop instruction.
Based on the above test results, I would like to consult everyone if there are any defects in my testing or how to explain the above test results.
Thank you for your comment. In your comment, I learned a lot about the details of inline assembly (but there are still some that I haven't understood and will continue to learn later), and made modifications to the previous code as follows:
"r"(global_lock)input operand to"+m" (* global_lock)%raxto do this, and indicate it inClobbersSorry, I didn't understand the meaning of this part. Is it necessary to first execute the xchg instruction before executing the pure load instruction such as mov? For example:
But this seems meaningless. I am not sure if atomic RMW instructions like xchg can cause memory order violation, but subsequent pure-load instructions (mov) will no longer cause memory order violation, because store operations on that memory do not occur on any other CPU before release lock
I have tested the following methods for obtaining locks:
The test results are as follows:
The difference in
machine_clears.memory_orderingis not significant in the different versions above,inst_retired.any, too. Can it be explained here that the cycle ofpausevaries in different scenarios.I also think so. Later, I will do some tests for this. But I still want to know how much improvement can be achieved by using the pause instruction to avoid memory order Violation (without considering CPU hyper threading optimization), or what kind of code can be written to obtain beautiful data.
Thank you again for Peter's answer.