I spent quite a lot of time hand-optimizing low-level integer arithmetic, with some success. For instance, my subroutine for 6x6 multiplication spends 66 ticks compared to 82 ticks of mpn_mul_basecase(6,6)
on Skylake. My code is published on Github.
I am currently working on 8x8 multiplication for AMD Ryzen. I'm using Ryzen 7 3800X for benchmarking. I try hard to avoid latencies. I've studied Agner Fog's "Instruction tables" and also Torbjörn Granlund's "Instruction latencies ...". Nothing suggests major problems with adox/adcx on Ryzen; there should be no big difference between Ryzen and Skylake concerning adox/adcx. I've benchmarked a multiply 8x1 subroutine using mulx and one of adcq, adox or adcx; all three variants of the subroutine run fast both on Skylake and Ryzen (18-19 ticks).
However when I attempt to mix together adox and adcx, my code runs awfully slow on Ryzen. For instance, my 8x2 multiplication subroutine spends 34 ticks on Skylake i7-6700 and 293 ticks on Ryzen 7 3800X (8 times difference).
Any suggestion why the mulx/adox/adcx code performs 8 times slower on Ryzen?
Getting rid of heavy xmm/ymm usage solved the problem.
modified subroutine only costs 42 ticks.
Looks like Ryzen has no problems with adox/adcx. Ryzen obviously has problems with vmovdqu mem to register and/or vpextrq and/or vperm2i128.
The question was silly.
@NateEldredge Your hint was helpful. Thank you.