I'm investigating how rep string instructions work. Regarding the description of the instructions the rep movsl for example has the following mnemonic.
rep movsl 5+4*(E)CX Move (E)CX dwords from [(E)SI] to ES:[(E)DI]
Where ES is extra segment register that should contain some offset to the beginning of extra segment in the memory.
The pseudo code of the operation look like below
while (CX != 0) {
*(ES*16 + DI) = *(DS*16 + SI);
SI++;
DI++;
CX--;
}
But it seems in flat memory model it is not true that rep string operates with extra segment.
For example I've created a test that creates 2 threads which copy an array to TLS (thread local storage) array using rep movs. Logically that should not work as TLS data is kept in GS segment, not in ES. But works. At least a see the correct result running the test. Intel compiler produces the following piece of code for coping.
movl %gs:0, %eax #27.18
movl $1028, %ecx #27.18
movl 32(%esp), %esi #27.18
lea TLS_data1@NTPOFF(%eax), %edi #27.18
movl %ecx, %eax #27.18
shrl $2, %ecx #27.18
rep #27.18
movsl #27.18
movl %eax, %ecx #27.18
andl $3, %ecx #27.18
rep #27.18
movsb #27.18
Here %edi points to TLS array and rep movs store there. In case rep mov uses ES offset implicitly (that I doubt), then such code should not produce the correct result.
Do I miss something here?
There is the test I created:
#define _MULTI_THREADED
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define NUMTHREADS 2
#define N 257
typedef struct {
int data1[N];
} threadparm_t;
__thread threadparm_t TLS_data1;
void foo();
void *theThread(void *parm)
{
int rc;
threadparm_t *gData;
pthread_t self = pthread_self();
printf("Thread %u: Entered\n", self);
gData = (threadparm_t *)parm;
TLS_data1 = *gData;
foo();
return NULL;
}
void foo() {
int i;
pthread_t self = pthread_self();
printf("\nThread %u: foo()\n", self/1000000);
for (i=0; i<N; i++) {
printf("%d ", TLS_data1.data1[i]);
}
printf("\n\n");
}
int main(int argc, char **argv)
{
pthread_t thread[NUMTHREADS];
int rc=0;
int i,j;
threadparm_t gData[NUMTHREADS];
printf("Enter Testcase - %s\n", argv[0]);
printf("Create/start threads\n");
for (i=0; i < NUMTHREADS; i++) {
/* Create per-thread TLS data and pass it to the thread */
for (j=0; j < N; j++) {
gData[i].data1[j] = i+1;
}
rc = pthread_create(&thread[i], NULL, theThread, &gData[i]);
}
printf("Wait for the threads to complete, and release their resources\n");
for (i=0; i < NUMTHREADS; i++) {
rc = pthread_join(thread[i], NULL);
}
printf("Main completed\n");
return 0;
}
What you are missing is that these ops:
are loading the zero-based address of the thread-local
TLS_data1
into%edi
, which will work fine with the zero-base ES segment.