I'm copying elements from one array to another in C++. I found the rep movs instruction in x86 that seems to copy an array at ESI to an array at EDI of size ECX. However, neither the for nor while loops I tried compiled to a rep movs instruction in VS 2008 (on an Intel Xeon x64 processor). How can I write code that will get compiled to this instruction?
What C++ code compiles down to the x86 REP instruction?
4k views Asked by securelsh AtThere are 6 answers
 On
                        
                            
                        
                        
                            On
                            
                            
                                                    
                    
                Honestly, you shouldn't. REP is sort of an obsolete holdover in the instruction set, and actually pretty slow since it has to call a microcoded subroutine inside the CPU, which has a ROM lookup latency and is nonpipelined as well.
In almost every implementation, you will find that the memcpy() compiler intrinsic both is easier to use and runs faster.
 On
                        
                            
                        
                        
                            On
                            
                            
                                                    
                    
                Under MSVC there are the __movsxxx & __stosxxx intrinsics that will generate a REP prefixed instruction. 
there is also a 'hack' to force intrinsic memset aka REP STOS under vc9+, as the intrinsic no longer exits, due to the sse2 branching in the crt. this is better that __stosxxx due to the fact the compiler can optimize it for constants and order it correctly.
#define memset(mem,fill,size) memset((DWORD*)mem,((fill) << 24|(fill) << 16|(fill) << 8|(fill)),size)
__forceinline void memset(DWORD* pStart, unsigned long dwFill, size_t nSize)
{
    //credits to Nepharius for finding this
    DWORD* pLast = pStart + (nSize >> 2);
    while(pStart < pLast)
        *pStart++ = dwFill;
    if((nSize &= 3) == 0)
        return;
    if(nSize == 3)
    {
        (((WORD*)pStart))[0]   = WORD(dwFill);
        (((BYTE*)pStart))[2]   = BYTE(dwFill);
    }
    else if(nSize == 2)
        (((WORD*)pStart))[0]   = WORD(dwFill);
    else
        (((BYTE*)pStart))[0]   = BYTE(dwFill);
}
of course REP isn't always the best thing to use, imo your way better off using memcpy, it'll branch to either sse2 or REPS MOV based on your system (under msvc), unless you feeling like writing custom assembly for 'hot' areas...
 On
                        
                            
                        
                        
                            On
                            
                            
                                                    
                    
                REP and friends was nice once upon a time, when the x86 CPU was a single-pipeline industrial CISC-processor.
But that has changed. Nowadays when the processor encounters any instruction, the first it does is translating it into an easier format (VLIW-like micro-ops) and schedules it for future execution (this is part of out-of-order-execution, part of scheduling between different logical CPU cores, it can be used to simplifying write-after-write-sequences into single-writes, et.c.). This machinery works well for instructions that translates into a few VLIW-like opcodes, but not machine-code that translates into loops. Loop-translated machine code will probably cause the execution pipeline to stall.
Rather than spending hundreds of thousands of transistors into building CPU-circuitry for handling looping portions of the micro-ops in the execution pipeline, they just handle it in some sort of crappy legacy-mode that stutterly stalls the pipeline, and ask modern programmers to write your own damn loops!
Therefore it is seldom used when machines write code. If you encounter REP in a binary executable, its probably a human assembly-muppet who didn't know better, or a cracker that really needed the few bytes it saved to use it instead of an actual loop, that wrote it.
(However. Take everything I just wrote with a grain of salt. Maybe this is not true anymore. I am not 100% up to date with the internals of x86 CPUs anymore, I got into other hobbies..)
 On
                        
                            
                        
                        
                            On
                            
                            
                                                    
                    
                I use the rep* prefix variants with cmps*, movs*, scas* and stos* instruction variants to generate inline code which minimizes the code size, avoids unnecessary calls/jumps and thereby keeps down the work done by the caches. The alternative is to set up parameters and call a memset or memcpy somewhere else which may overall be faster if I want to copy a hundred bytes or more but if it's just a matter of 10-20 bytes using rep is faster (or at least was the last time I measured).
Since my compiler allows specification and use of inline assembly functions and includes their register usage/modification in the optimization activities it is possible for me to use them when the circumstances are right.
 On
                        
                            
                        
                        
                            On
                            
                            
                                                    
                    
                On a historic note - not having any insight into the manufacturer's strategies - there was a time when the "rep movs*" (etc) instructions were very slow. I think it was around the time of the Pentium/Pentium MMX. A colleague of mine (who had more insight than I) said that the manufacturers had decreased the chip area (<=> fewer transistors/more microcode) allocated to the rep handling and used it to make other, more used instructions faster.
In the fifteen years or so since rep has become relatively speaking faster again which would suggest more transistors/less microcode.
If you need exactly that instruction - use built-in assembler and write that instruction manually. You can't rely on the compiler to produce any specific machine code - even if it emits it in one compilation it can decide to emit some other equivalent during next compilation.