If I have a simple piece of code using uint32_t then it can be optimised better than the same code with uint8_t. As far as I know this is because char has exemptions to the strict aliasing rules. Consider:
using T = uint32_t;
T *a;
T *b;
T *c;
void mult(int num)
{
for (int count = 0; count < num; count++)
{
a[count] = b[count] * c[count];
}
}
https://godbolt.org/z/sW1xnTrhc
This has an inner loop in -01 of:
.LBB0_2: # =>This Inner Loop Header: Depth=1
mov r8d, dword ptr [rcx + 4*rdi]
imul r8d, dword ptr [rax + 4*rdi]
mov dword ptr [rdx + 4*rdi], r8d
inc rdi
cmp rsi, rdi
jne .LBB0_2
Note in this case it simply load one value, does a multiply, stores the result, and loops. This is good. However if I used uint8_t (https://godbolt.org/z/doM4o6ena) I get this inner loop from clang:
.LBB0_2: # =>This Inner Loop Header: Depth=1
mov rsi, qword ptr [rip + b] # see here
mov rax, qword ptr [rip + c] # see here
movzx eax, byte ptr [rax + rdx]
mul byte ptr [rsi + rdx]
mov rsi, qword ptr [rip + a] # see here
mov byte ptr [rsi + rdx], al
inc rdx
cmp rcx, rdx
jne .LBB0_2
Note that this inner loop loads the values of a, b and c every single iteration. This is as I understand because the storage of the pointer for a, b and c may alias with what is pointed to, and so the loop must run each iteration separately, and reload the values. This gets even worse with higher optimisation levels. Using uint16_t and/or uint32_t with -O3 the compiler does all sorts of SIMD/XMM wizardry but the uint8_t/char loop remains stubbornly simple and unoptimised.
Note I am not asking for ways round this using restrict, or avoiding global variables. Nor am I asking for ways to optimise this specific example.
What I am asking is if there is a simple 8 bit arithmetic type which I can use which can't fall into this trap.