I need to implement the function:
promote(
unsigned char* __restrict__ dest,
unsigned char* __restrict__ src
size_t src_element_size,
size_t dest_element_size,
size_t n
);
which takes an array of n (unsigned) integers, each of which represented by src_element_size
consecutive bytes, and writes as output an array of n integers each represented by another number of bytes, dest_element_size
. Let's assume dest_element_size > src_element_size, and that both arrays are aligned appropriately for their own type.
What would be a fast/the fastest way to perform this conversion?
Notes:
- This is not a
memcpy()
, normemset()
+memcpy()
. For example, If the source element size is 1 and the target element size is 2, and the source bytes are123, 123, 123
then the destination bytes are123, 0, 123, 0, 123, 0
. You could think of it as a "stridedmemcpy()
" I suppose. - The endianness is either the machine endianness or little-endianness, whichever you wish to assume.
src_element_size
anddest_element_size
are typically small and power-of-two (1,2,4,8) - but I'd rather get answers which are relevant to somewhat larger sizes (say < 200), and also for non-power-of-two sizes (3, 6 etc.)- The function itself may not be templated, but a suggestion for templating the implementation would be valid. The reason is that this will sit in a compiled library and code using it will not be recompiled with the template definition/header with __Generic.
- No multi-threading / GPUs/ Xeon Phis/ or other exotic hardwaree.
src_element_size
is less-than-or-equal-todest_element_size
.
If you can't use C
_Generic
or C++ templates, then you have to use a memset zero + memcpy.The result will be "left-aligned", so this will only work for little endian.
An endianess-independent version will be a bit more intricate. You would have to replace memcpy with a loop, which bit shifts the individual bytes in place.
Or possibly something like this:
As someone pointed out a comment, you should declare both character pointers as
restrict
to tell the compiler that it can assume they aren't pointing at the same location, for a bit of micro-optimization.A solution with
_Generic
might look something like (not tested):Better yet, you can adjust this to for even better type safety: