Are compilers generally able to condense multiple contiguous memory copies into one operation?

53 views Asked by At

Something like the following:

struct Vec2
{
    int x, y;
};

struct Bounds
{
   int left, top, right, bottom;
};

int main()
{
    Vec2 topLeft = {5, 5};
    Vec2 bottomRight = { 10, 10 }; 
    Bounds bounds;
//___Here is copy operation
//___Note they're not in contiguous order, harder for the compiler?
    bounds.left = topLeft.x;
    bounds.bottom = bottomRight.y;
    bounds.top = topLeft.y;
    bounds.right = bottomRight.x;
}

Those four assignments could be done like so:

memcpy(&bounds, &topLeft, sizeof(Vec2));
memcpy(&bounds.right, &bottomRight, sizeof(Vec2));

I'm wondering two things:

  1. Are compilers usually able to optimise in this way?
  2. Are four int copies the same as two int pair copies, as copying memory is O(n)?

I got the following disassembly results for the four copies:

bounds.left = topLeft.x;
00007FF642291034  mov         dword ptr [bounds],5  
bounds.bottom = bottomRight.y;
00007FF64229103C  mov         dword ptr [rsp+2Ch],0Ah  
bounds.top = topLeft.y;
00007FF642291044  mov         dword ptr [rsp+24h],5  
bounds.right = bottomRight.x;
00007FF64229104C  mov         dword ptr [rsp+28h],0Ah  

And confusingly, the two memcpys are different instructions for the first one and second one, I don't understand this:

memcpy(&bounds, &topLeft, sizeof(Vec2));
00007FF64229105E  mov         rbx,qword ptr [topLeft]   // This is only one instruction
memcpy(&bounds.right, &bottomRight, sizeof(Vec2));
00007FF642291063  mov         rdi,qword ptr [bottomRight]  // Compared to 6?  
00007FF642291068  mov         qword ptr [bounds],rbx  
00007FF64229106D  mov         qword ptr [rsp+28h],rdi  
00007FF642291072  jmp         main+7Eh (07FF64229107Eh)  
00007FF642291074  mov         rdi,qword ptr [rsp+28h]  
00007FF642291079  mov         rbx,qword ptr [bounds]  
1

There are 1 answers

0
MSalters On

Any modern compiler supporting threads has to consider instruction dependencies and reordering. With that technology in place, it will quickly discover that there are no dependencies in the set of instructions you have, which means they can be reordered in linear memory order, and then combined.

Not that it likely matters; the CPU cache will just load the whole cache line on first access, and flush the whole cache line at some later point. It's these operations which take time, not the CPU operations themselves.