Efficient UTF-8 character-length decoding for a non-zero character in a 32 bit register

669 views Asked by At

I'm storing a UTF-8 character in eax and later on, in processing, I need to know how many bytes make up the character.

I've narrowed down going about this that minimizes shifts and masks and wanted to know am I missing some neat trick somewhere?

Option 1: Brute Force

    mov     r11, 4      ;   Maximum bytes
    bt      eax, 31     ;   Test 4th MSB
    jc      .exit 
    dec     r11         ;   Lets try 3
    bt      eax, 23     ;   Test 3rd MSB
    jc      .exit 
    dec     r11         ;   Lets try 2
    bt      eax, 15     ;   Test 2nd MSB
    jc      .exit 
    dec     r11         ;   It's straight up ascii (1 byte)
.exit:

Note:

  1. I had the accumulation in the eax register wrong as pointed out by, well, everyone.
  2. Both Margaret and Ped7g provided solutions and I learnt even more than expected.
2

There are 2 answers

10
Margaret Bloom On BEST ANSWER

If you can assume a correct encoding of the character, you can simply check where the highest zero is in the first code-unit (thanks to the auto-synch property of UTF-8).

The culprit being that for code-points of one code-unit the highest zero is bit 7. For the code-points of n code-units the highest bit is 7 - n (note the "discontinuity").

Assuming the first code-unit is in al.

not al                 ;Trasform highest 0 in highest 1
bsr al, al             ;Find the index (from bit0) of the first 1 from the left
xor al, 7              ;Perform 7 - index
                       ;This gives 0 for single code unit code points
mov ah, 1
cmovz al, ah           ;Change back to 1

Note that bsr is not defined for an input of 0, but that can only happen for an invalid leading code-unit (of value 11111111b).

You can detect the invalid 0xff code-unit with a jz <error handler> after the bsr instruction.

Thanks to @CodyGray for pointing out a bug on the original version.
Thanks to @PeterCorders for pointing out the XOR trick to do 7 - AL.

8
Ped7g On

If you insist on the reversed order of bytes (for whatever weird reason), you can still simply scan for first bit set to 1, divide by 8 and +1 to get number of bytes.

GetReversedShiftedUtf8BytesCount:
    ; eax = UTF8 code in reversed order, by from LSB
    ; 'É' (c3 89) => eax = 0x0000c389
    bsr ecx,eax
    cmovz ecx,eax   ; needed only for eax = 0
      ; ^ if eax is never 0 on input, this "cmovz" can be removed
    shr ecx,3
    inc ecx
    ret

As you put the first byte of char into MSB, it will produce number of bit 15, 23 or 31 for multibyte chars, for 7b ASCII anything from 0 to 6 will be produced by bsr. "div 8" will fix them straight all, either way, it doesn't care.

This routine should actually work also with valid normal UTF8 codes.

For invalid UTF8 code ending with zero byte it will return wrong number of bytes (without the zero ones).


There's of course as always also LUT solution possible:

    movzx  ecx,al
    shr    ecx,3
    movzx  ecx,byte [utf8lengthLUT + ecx]  ; +rcx for 64b
    ; ecx = number of bytes or 0 for invalid leading byte value
    ...

utf8lengthLUT:                     ; 32B look-up table for upper 5b of 1st byte
    db     1, 1, 1, 1, 1, 1, 1, 1  ; 00000 - 00111 ; single byte
    db     1, 1, 1, 1, 1, 1, 1, 1  ; 01000 - 01111 ; single byte
    db     0, 0, 0, 0, 0, 0, 0, 0  ; 10000 - 10111 ; not valid leading byte
    db     2, 2, 2, 2              ; 11000 - 11011 ; two bytes code point
    db     3, 3                    ; 11100 - 11101 ; three bytes code point
    db     4                       ; 11110         ; four bytes code point
    db     0                       ; 11111         ; not valid leading byte

I didn't debug it, just tried to translate with nasm for syntax check. Nor did I profile it of course. :) Looking at shortness of that bsr variant, I'm in doubt this will be seriously faster even on CPUs where bsr hurts.

But this one treats invalid UTF8 opcodes differently, instead of detecting non-zero MSB and returning number of it+1 (not sensitive to leading byte content), it will decode the leading byte info properly and return 0 when leading bits are wrong. But correct leading bits with incorrect 2nd+ byte (like c3 00) will still return 2, while the first variant returns 1 in such case.

(it's possible to go with only 16B LUT table, if you don't care about invalid 11111 leading byte info and you will take that as 4 byte code point)


BTW, there are some i18n libraries (open source), doing all those things like validating utf8 inputs, fixing invalid ones, counting chars, etc... Some of these are around for more than a decade... and yet still getting bug reports and fixes. Which is sort of an subtle hint how difficult it is to write this stuff correctly (without exposing app to some input data exploit). :)

(plus considering how many (fixing) edits received these two answers... :) )

And one more offtopic advice: if you will ever try to write something with PHP, supposed to process UTF8 input data (which are not from trusted source, but even from trusted source), especially if those input data are from GET/POST response... just don't on your own. Never. Get some framework for that one. :)