I wrote the following test class in java to reproduce the performance penalty introduced by "False Sharing".
Basically you can tweak the "size" of array from 4 to a much larger value (e.g. 10000) to turn the "False Sharing phenomenon" either on or off. To be specific, when size = 4, different threads are more likely to update values within the same cache line, causing much more frequent cache misses. In theory, the test program should run much faster when size = 10000 than size = 4.
I ran the same test on two different machines multiple times:
Machine A: Lenovo X230 laptop w/ Intel® Core™ i5-3210M Processor (2 core, 4 threads) Windows 7 64bit
size = 4 => 5.5 second
size = 10000 => 5.4 second
Machine B: Dell OptiPlex 780 w/ Intel® Core™2 Duo Processor E8400 (2 core) Windows XP 32bit
size = 4 => 14.5 second
size = 10000 => 7.2 second
I ran the tests later on a few other machines and quite obviously False Sharing only becomes noticeable on certain machines and I couldn't figure out the decisive factor that makes such difference.
Can anyone kindly take a look at this problem and explain why false sharing introduced in this test class only became noticeable on certain machines?
public class FalseSharing {
interface Oper {
int eval(int value);
}
//try tweak the size
static int size = 4;
//try tweak the op
static Oper op = new Oper() {
@Override
public int eval(int value) {
return value + 2;
}
};
static int[] array = new int[10000 + size];
static final int interval = (size / 4);
public static void main(String args[]) throws InterruptedException {
long start = System.currentTimeMillis();
Thread t1 = new Thread(new Runnable() {
@Override
public void run() {
System.out.println("Array index:" + 5000);
for (int j = 0; j < 30; j++) {
for (int i = 0; i < 1000000000; i++) {
array[5000] = op.eval(array[5000]);
}
}
}
});
Thread t2 = new Thread(new Runnable() {
@Override
public void run() {
System.out.println("Array index:" + (5000 + interval));
for (int j = 0; j < 30; j++) {
for (int i = 0; i < 1000000000; i++) {
array[5000 + interval] = op.eval(array[5000 + interval]);
}
}
}
});
Thread t3 = new Thread(new Runnable() {
@Override
public void run() {
System.out.println("Array index:" + (5000 + interval * 2));
for (int j = 0; j < 30; j++) {
for (int i = 0; i < 1000000000; i++) {
array[5000 + interval * 2] = op.eval(array[5000 + interval * 2]);
}
}
}
});
Thread t4 = new Thread(new Runnable() {
@Override
public void run() {
System.out.println("Array index:" + (5000 + interval * 3));
for (int j = 0; j < 30; j++) {
for (int i = 0; i < 1000000000; i++) {
array[5000 + interval * 3] = op.eval(array[5000 + interval * 3]);
}
}
}
});
t1.start();
t2.start();
t3.start();
t4.start();
t1.join();
t2.join();
t3.join();
t4.join();
System.out.println("Finished!" + (System.currentTimeMillis() - start));
}
}
Your code is probably fine, Here is a simpler version with results:
Results:
So to answer, it confirms the 64 bytes cache line theory, at least on my laptop core i5.