I found many copies of the same implementation of the Karatsuba algorithm for ARM NEON on the internet, including some science papers, eg http://conradoplg.cryptoland.net/files/2010/12/mocrysen13.pdf
It looks like all these copies have the same glitch. While being said that it does arithmetic multiply, the result is still polynomial. If changing the vmull.p8
to vmull.u8
it produces a correct result for small numbers with one lowest word. But big numbers, that exceed lowest word, are not correct.
Implemention looks good:
.macro mul64_p8 rq rl rh ad bd k16 k32 k48 t0q t0l t0h t1q t1l t1h t2q t2l t2h t3q t3l t3h
@A1
vext.8 \t0l, \ad, \ad, $1
@F = A1*B
vmull.p8 \t0q, \t0l, \bd
@B1
vext.8 \rl, \bd, \bd, $1
@E = A*B1 (7)
vmull.p8 \rq, \ad, \rl
@A2
vext.8 \t1l, \ad, \ad, $2
@H = A2*B
vmull.p8 \t1q, \t1l, \bd
@B2
vext.8 \t3l, \bd, \bd, $2
@G = A*B2
vmull.p8 \t3q, \ad, \t3l
@A3
vext.8 \t2l, \ad, \ad, $3
@J = A3*B
vmull.p8 \t2q, \t2l, \bd
@L = E + F
veor \t0q, \t0q, \rq
@B3
vext.8 \rl, \bd, \bd, $3
@I = A*B3
vmull.p8 \rq, \ad, \rl
@M = G + H
veor \t1q, \t1q, \t3q
@B4
vext.8 \t3l, \bd, \bd, $4
@K = A*B4
vmull.p8 \t3q, \ad, \t3l
@t0 = (L) (P0 + P1) << 8
veor \t0l, \t0l, \t0h
vand \t0h, \t0h, \k48
@t1 = (M) (P2 + P3) << 16
veor \t1l, \t1l, \t1h
vand \t1h, \t1h, \k32
@N = I + J
veor \t2q, \t2q, \rq
veor \t0l, \t0l, \t0h
veor \t1l, \t1l, \t1h
@t2 = (N) (P4 + P5) << 24
veor \t2l, \t2l, \t2h
vand \t2h, \t2h, \k16
@t3 = (K) (P6 + P7) << 32
veor \t3l, \t3l, \t3h
vmov.i64 \t3h, $0
vext.8 \t0q, \t0q, \t0q, $15
veor \t2l, \t2l, \t2h
vext.8 \t1q, \t1q, \t1q, $14
vmull.p8 \rq, \ad, \bd
vext.8 \t2q, \t2q, \t2q, $13
vext.8 \t3q, \t3q, \t3q, $12
veor \t0q, \t0q, \t1q
veor \t2q, \t2q, \t3q
veor \rq, \rq, \t0q
veor \rq, \rq, \t2q
.endm
But why is it not correct as stated? Basically I need an efficient 64x64=128 arithmetic operation for arm-v7a, and looks like NEON is the only choice here.
It feels like losing a carry flag somewhere among these operations, because results I observe are quite close to correct ones.