Why adox and adcx don't play well together on Ryzen?

297 views Asked by At

I spent quite a lot of time hand-optimizing low-level integer arithmetic, with some success. For instance, my subroutine for 6x6 multiplication spends 66 ticks compared to 82 ticks of mpn_mul_basecase(6,6) on Skylake. My code is published on Github.

I am currently working on 8x8 multiplication for AMD Ryzen. I'm using Ryzen 7 3800X for benchmarking. I try hard to avoid latencies. I've studied Agner Fog's "Instruction tables" and also Torbjörn Granlund's "Instruction latencies ...". Nothing suggests major problems with adox/adcx on Ryzen; there should be no big difference between Ryzen and Skylake concerning adox/adcx. I've benchmarked a multiply 8x1 subroutine using mulx and one of adcq, adox or adcx; all three variants of the subroutine run fast both on Skylake and Ryzen (18-19 ticks).

However when I attempt to mix together adox and adcx, my code runs awfully slow on Ryzen. For instance, my 8x2 multiplication subroutine spends 34 ticks on Skylake i7-6700 and 293 ticks on Ryzen 7 3800X (8 times difference).

Any suggestion why the mulx/adox/adcx code performs 8 times slower on Ryzen?

1

There are 1 answers

9
Денис Крыськов On

Getting rid of heavy xmm/ymm usage solved the problem.

modified subroutine only costs 42 ticks.

Looks like Ryzen has no problems with adox/adcx. Ryzen obviously has problems with vmovdqu mem to register and/or vpextrq and/or vperm2i128.

The question was silly.

@NateEldredge Your hint was helpful. Thank you.