Why adox and adcx don't play well together on Ryzen?

Question

Why adox and adcx don't play well together on Ryzen?

297 views Asked by Денис Крыськов At 28 November 2020 at 12:48

I spent quite a lot of time hand-optimizing low-level integer arithmetic, with some success. For instance, my subroutine for 6x6 multiplication spends 66 ticks compared to 82 ticks of mpn_mul_basecase(6,6) on Skylake. My code is published on Github.

I am currently working on 8x8 multiplication for AMD Ryzen. I'm using Ryzen 7 3800X for benchmarking. I try hard to avoid latencies. I've studied Agner Fog's "Instruction tables" and also Torbjörn Granlund's "Instruction latencies ...". Nothing suggests major problems with adox/adcx on Ryzen; there should be no big difference between Ryzen and Skylake concerning adox/adcx. I've benchmarked a multiply 8x1 subroutine using mulx and one of adcq, adox or adcx; all three variants of the subroutine run fast both on Skylake and Ryzen (18-19 ticks).

However when I attempt to mix together adox and adcx, my code runs awfully slow on Ryzen. For instance, my 8x2 multiplication subroutine spends 34 ticks on Skylake i7-6700 and 293 ticks on Ryzen 7 3800X (8 times difference).

Any suggestion why the mulx/adox/adcx code performs 8 times slower on Ryzen?

Original Q&A

There are 1 answers

**Денис Крыськов** · Answer 1 · 2020-11-30T12:10:12+00:00

Getting rid of heavy xmm/ymm usage solved the problem.

modified subroutine only costs 42 ticks.

Looks like Ryzen has no problems with adox/adcx. Ryzen obviously has problems with vmovdqu mem to register and/or vpextrq and/or vperm2i128.

The question was silly.

@NateEldredge Your hint was helpful. Thank you.

TechQA.

Why adox and adcx don't play well together on Ryzen?

There are 1 answers

Related Questions in X86

Related Questions in CPU-ARCHITECTURE

Related Questions in GMP

Related Questions in AMD-PROCESSOR

Related Questions in ADX

Popular Questions

Popular Tags

Trending Questions