I find myself tuning a piece of code where memory is copied using memcpy
and the third parameter (size) is known at compile time.
The consumer of the function calling memcpy
does something similar to this:
template <size_t S>
void foo() {
void* dstMemory = whateverA
void* srcMemory = whateverB
memcpy(dstMemory, srcMemory, S)
}
Now, I would have expected that the memcpy
intrinsic was smart enough to realise that this:
foo<4>()
... Can replace the memcpy
in the function with a 32 bit integer assignment. However, I surprisingly find myself seeing a >2x speedup doing this:
template<size_t size>
inline void memcpy_fixed(void* dst, const void* src) {
memcpy(dst, src, size);
}
template<>
inline void memcpy_fixed<4>(void* dst, const void* src) { *((uint32_t*)dst) = *((uint32_t*)src); }
And rewriting foo
to:
template <size_t S>
void foo() {
void* dstMemory = whateverA
void* srcMemory = whateverB
memcpy_fixed<S>(dstMemory, srcMemory)
}
Both tests are on clang (OS X) with -O3. I really would have expected the memcpy
intrinsic to be smarter about the case where the size is known at compile time.
My compiler flags are:
-gline-tables-only -O3 -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer
Am I asking too much of the c++ compiler or is there some compiler flag I am missing?
If both source and destination buffers are provided as function parameters:
then clang++ 3.5.0 uses
memcpy
only whenS
is big but it uses themovl
instruction whenS = 4
.However, your source and destination addresses are not parameters of this function and this seems to prevent the compiler from making this aggressive optimization.