Compare two 16-byte values for equality using up to SSE 4.2?

1k views Asked by At

I have a struct like this:

struct {
    uint32_t a;
    uint16_t b;
    uint16_t c;
    uint16_t d;
    uint8_t  e;
} s;

and I would like to compare two of the above structs for equality, in the fastest way possible. I looked at the Intel Intrinsics Guide but couldn't find a compare for integers, the options available were mainly doubles and single-floating point vector-inputs.

Could somebody please advise the best approach? I can add a union to my struct to make processing easier.

I am limited (for now) to using SSE4.2, but any AVX answers would be welcome too if they are significantly faster. I am using GCC 4.8.2

2

There are 2 answers

2
zx485 On

A simple solution would be to just subtract the two structs byte wise after masking so you get an all-zero-value only if all packed bytes are identical. This code is in MASM format, but you surely can adapt that to gcc AT&T syntax or intrinsicals:

.data
  mask11byte db 0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0,0,0,0,0
.code
  pand  xmm1, xmmword ptr [mask11byte]
  pand  xmm2, xmmword ptr [mask11byte]
  psubb xmm1, xmm2
  ptest xmm1, xmm1   ; SSE 4.1
  setz al     ; AL=TRUE for equal

Addition: Because the size of the struct is 11 byte, 256bit/32byte-AVX(x) would make no sense.

0
Peter Cordes On

What @zx485 should have written is:

.data
  mask11byte db 0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0ffh,0,0,0,0,0
.code
  pxor xmm1, xmm2  ; equiv to psubb, but runs on all 3 vector execution ports
  ptest xmm1, xmmword ptr [mask11byte]   ; SSE 4.1
  setz al     ; AL=TRUE for equal

As long as nothing bad happens (floating point exceptions), you don't need to mask off your operands before computation, even if they hold garbage. And since PTEST does a bitwise AND as part of its operation, you don't need a separate PAND at all.

For a while, I thought I had a version that could use less space and fewer uops, but I ended up needing an extra instruction because there's no pcmpneq (so I needed a logical not). So it's smaller, the same number of uops, but significantly worse latency.

.code
  PCMPEQB xmm1, xmm2  ; bytes of xmm1 = 0xFF on equal
  PMOVMSKB eax, xmm1  ; ax = high bit of each byte of xmm1
  NOT eax
  TEST eax, 0x7FF  ; zero flag set if all the low 11 bits are zero
  SETZ al    ; 17 bytes

; Or one fewer insn with BMI1's ANDN.  One fewer uop if test can't macro-fuse
  ANDN eax, eax, [mask11bits]   ; only test the low 11 bits.
;  ANDN version takes 20 bytes, plus 2B of data
.data
  mask11bits dw 07ffh

test can macro-fuse with jcc, so if you're using this as a jump condition instead of actually doing setz, you come out ahead on size. (since you don't need the 16B mask constant.)

ptest takes 2 uops, so the ptest version is 4 uops total (including the jcc or other instruction). The pmovmskb version is also 4 uops with a test/jcc macro-fused branch, but 5 with cmovcc / setcc. (4 with andn, with any of setcc / cmovcc / jcc since it can't macro-fuse`.)

(Agner Fog's table says ptest takes 1 fused-domain uop on Sandybridge, 2 on all other Intel CPUs that support it. I'm not sure I believe that, though.)

Latency on Haswell (important if the branch doesn't predict well):

  • pxor: 1 + ptest: 2 = 3 cycles
  • pcmpeqb: 1 + pmovmskb: 3 + not: 1 + test: 1 = 6 cycles
  • pcmpeqb: 1 + pmovmskb: 3 + andn: 1 = 5 cycles (but not macro-fused, so possibly 1 more cycle of latency?)

So the ptest version has significantly shorter latency: jcc can execute sooner, to detect branch mispredicts sooner.

Agner Fog's tests show ptest has latency = 3 on Nehalem, 1 on SnB/IvB, 2 on Haswell.