I'm storing a UTF-8 character in eax
and later on, in processing, I need to know how many bytes make up the character.
I've narrowed down going about this that minimizes shifts and masks and wanted to know am I missing some neat trick somewhere?
Option 1: Brute Force
mov r11, 4 ; Maximum bytes
bt eax, 31 ; Test 4th MSB
jc .exit
dec r11 ; Lets try 3
bt eax, 23 ; Test 3rd MSB
jc .exit
dec r11 ; Lets try 2
bt eax, 15 ; Test 2nd MSB
jc .exit
dec r11 ; It's straight up ascii (1 byte)
.exit:
Note:
- I had the accumulation in the
eax
register wrong as pointed out by, well, everyone. - Both Margaret and Ped7g provided solutions and I learnt even more than expected.
If you can assume a correct encoding of the character, you can simply check where the highest zero is in the first code-unit (thanks to the auto-synch property of UTF-8).
The culprit being that for code-points of one code-unit the highest zero is bit 7. For the code-points of n code-units the highest bit is 7 - n (note the "discontinuity").
Assuming the first code-unit is in
al
.Note that
bsr
is not defined for an input of 0, but that can only happen for an invalid leading code-unit (of value 11111111b).You can detect the invalid 0xff code-unit with a
jz <error handler>
after thebsr
instruction.Thanks to @CodyGray for pointing out a bug on the original version.
Thanks to @PeterCorders for pointing out the XOR trick to do 7 - AL.