Karatsuba algorithm in ARM NEON implementation seems to have an error

259 views Asked by At

I found many copies of the same implementation of the Karatsuba algorithm for ARM NEON on the internet, including some science papers, eg http://conradoplg.cryptoland.net/files/2010/12/mocrysen13.pdf

It looks like all these copies have the same glitch. While being said that it does arithmetic multiply, the result is still polynomial. If changing the vmull.p8 to vmull.u8 it produces a correct result for small numbers with one lowest word. But big numbers, that exceed lowest word, are not correct. Implemention looks good:

.macro mul64_p8 rq rl rh ad bd k16 k32 k48 t0q t0l t0h t1q t1l t1h t2q t2l t2h t3q t3l t3h
    @A1
    vext.8  \t0l, \ad, \ad, $1
    @F = A1*B
    vmull.p8 \t0q, \t0l, \bd
    @B1
    vext.8  \rl, \bd, \bd, $1
    @E = A*B1 (7)
    vmull.p8 \rq, \ad, \rl
        @A2
        vext.8  \t1l, \ad, \ad, $2
        @H = A2*B
        vmull.p8 \t1q, \t1l, \bd
        @B2
        vext.8  \t3l, \bd, \bd, $2
        @G = A*B2
        vmull.p8 \t3q, \ad, \t3l
            @A3
            vext.8  \t2l, \ad, \ad, $3
            @J = A3*B
            vmull.p8 \t2q, \t2l, \bd
    @L = E + F
    veor    \t0q, \t0q, \rq
            @B3
            vext.8  \rl, \bd, \bd, $3
            @I = A*B3
            vmull.p8 \rq, \ad, \rl
        @M = G + H
        veor    \t1q, \t1q, \t3q
                @B4
                vext.8  \t3l, \bd, \bd, $4
                @K = A*B4
                vmull.p8 \t3q, \ad, \t3l
    @t0 = (L) (P0 + P1) << 8
    veor    \t0l, \t0l, \t0h
    vand    \t0h, \t0h, \k48
        @t1 = (M) (P2 + P3) << 16
        veor    \t1l, \t1l, \t1h
        vand    \t1h, \t1h, \k32
            @N = I + J
            veor    \t2q, \t2q, \rq
    veor    \t0l, \t0l, \t0h
        veor    \t1l, \t1l, \t1h
            @t2 = (N) (P4 + P5) << 24
            veor    \t2l, \t2l, \t2h
            vand    \t2h, \t2h, \k16
                @t3 = (K) (P6 + P7) << 32
                veor    \t3l, \t3l, \t3h
                vmov.i64 \t3h, $0
    vext.8  \t0q, \t0q, \t0q, $15
            veor    \t2l, \t2l, \t2h
        vext.8  \t1q, \t1q, \t1q, $14
                vmull.p8 \rq, \ad, \bd
            vext.8  \t2q, \t2q, \t2q, $13
                vext.8  \t3q, \t3q, \t3q, $12
    veor    \t0q, \t0q, \t1q
    veor    \t2q, \t2q, \t3q
    veor    \rq, \rq, \t0q
    veor    \rq, \rq, \t2q
.endm

But why is it not correct as stated? Basically I need an efficient 64x64=128 arithmetic operation for arm-v7a, and looks like NEON is the only choice here.

It feels like losing a carry flag somewhere among these operations, because results I observe are quite close to correct ones.

0

There are 0 answers