memcpy where size is known at compile time

978 views Asked by At

I find myself tuning a piece of code where memory is copied using memcpy and the third parameter (size) is known at compile time.

The consumer of the function calling memcpy does something similar to this:

template <size_t S>
void foo() {
    void* dstMemory = whateverA
    void* srcMemory = whateverB
    memcpy(dstMemory, srcMemory, S) 
}

Now, I would have expected that the memcpy intrinsic was smart enough to realise that this:

foo<4>()

... Can replace the memcpy in the function with a 32 bit integer assignment. However, I surprisingly find myself seeing a >2x speedup doing this:

template<size_t size>
inline void memcpy_fixed(void* dst, const void* src) {
    memcpy(dst, src, size);
}


template<>
inline void memcpy_fixed<4>(void* dst, const void* src) { *((uint32_t*)dst) =  *((uint32_t*)src); }

And rewriting foo to:

 template <size_t S>
 void foo() {
    void* dstMemory = whateverA
    void* srcMemory = whateverB
    memcpy_fixed<S>(dstMemory, srcMemory) 
}

Both tests are on clang (OS X) with -O3. I really would have expected the memcpy intrinsic to be smarter about the case where the size is known at compile time.

My compiler flags are:

-gline-tables-only -O3 -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer

Am I asking too much of the c++ compiler or is there some compiler flag I am missing?

2

There are 2 answers

0
dlask On

If both source and destination buffers are provided as function parameters:

template <size_t S>
void foo(char* dst, const char* src) {
    memcpy(dst, src, S);
}

then clang++ 3.5.0 uses memcpy only when S is big but it uses the movl instruction when S = 4.

However, your source and destination addresses are not parameters of this function and this seems to prevent the compiler from making this aggressive optimization.

6
Non-maskable Interrupt On

memcpy is not the same as *((uint32_t*)dst) = *((uint32_t*)src).

memcpy can deal with unaligned memory.

By the way, most modern compiler do replace memcpy of known size with suitable code emission. for small size it usually emit things like rep movsb, which may not be fastest by good enough in most case.

If you found your particular case you gain 2x speed and you think you need to speed it up, you are free to get your hand dirty (with clear comments).