I've decided to measure incrementation with different locking strategies and using JMH for this purpose. I'm using JMH for checking throughput and average time as well as simple custom test for checking correctness. There are six strategies:
- Atomic count
- ReadWrite locking count
- Synchronizing with volatile
- Synchronizing block without volatile
- sun.misc.Unsafe.compareAndSwap
- sun.misc.Unsafe.getAndAdd
- Unsynchronizing count
Benchmark code:
@State(Scope.Benchmark)
@BenchmarkMode({Mode.Throughput, Mode.AverageTime})
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Fork(1)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
public class UnsafeCounter_Benchmark {
public Counter unsync, syncNoV, syncV, lock, atomic, unsafe, unsafeGA;
@Setup(Level.Iteration)
public void prepare() {
unsync = new UnsyncCounter();
syncNoV = new SyncNoVolatileCounter();
syncV = new SyncVolatileCounter();
lock = new LockCounter();
atomic = new AtomicCounter();
unsafe = new UnsafeCASCounter();
unsafeGA = new UnsafeGACounter();
}
@Benchmark
public void unsyncCount() {
unsyncCounter();
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public void unsyncCounter() {
unsync.increment();
}
@Benchmark
public void syncNoVCount() {
syncNoVCounter();
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public void syncNoVCounter() {
syncNoV.increment();
}
@Benchmark
public void syncVCount() {
syncVCounter();
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public void syncVCounter() {
syncV.increment();
}
@Benchmark
public void lockCount() {
lockCounter();
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public void lockCounter() {
lock.increment();
}
@Benchmark
public void atomicCount() {
atomicCounter();
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public void atomicCounter() {
atomic.increment();
}
@Benchmark
public void unsafeCount() {
unsafeCounter();
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public void unsafeCounter() {
unsafe.increment();
}
@Benchmark
public void unsafeGACount() {
unsafeGACounter();
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public void unsafeGACounter() {
unsafeGA.increment();
}
public static void main(String[] args) throws RunnerException {
Options baseOpts = new OptionsBuilder()
.include(UnsafeCounter_Benchmark.class.getSimpleName())
.threads(100)
.jvmArgs("-ea")
.build();
new Runner(baseOpts).run();
}
}
And results of bench:
JDK 8u20
Benchmark Mode Samples Score Error Units
o.k.u.u.UnsafeCounter_Benchmark.atomicCount thrpt 5 42.178 ± 17.643 ops/us
o.k.u.u.UnsafeCounter_Benchmark.lockCount thrpt 5 24.044 ± 2.264 ops/us
o.k.u.u.UnsafeCounter_Benchmark.syncNoVCount thrpt 5 22.849 ± 1.344 ops/us
o.k.u.u.UnsafeCounter_Benchmark.syncVCount thrpt 5 20.235 ± 2.027 ops/us
o.k.u.u.UnsafeCounter_Benchmark.unsafeCount thrpt 5 12.460 ± 1.326 ops/us
o.k.u.u.UnsafeCounter_Benchmark.unsafeGACount thrpt 5 39.106 ± 2.966 ops/us
o.k.u.u.UnsafeCounter_Benchmark.unsyncCount thrpt 5 93.076 ± 9.674 ops/us
o.k.u.u.UnsafeCounter_Benchmark.atomicCount avgt 5 2.604 ± 0.133 us/op
o.k.u.u.UnsafeCounter_Benchmark.lockCount avgt 5 4.161 ± 0.546 us/op
o.k.u.u.UnsafeCounter_Benchmark.syncNoVCount avgt 5 4.440 ± 0.523 us/op
o.k.u.u.UnsafeCounter_Benchmark.syncVCount avgt 5 5.073 ± 0.439 us/op
o.k.u.u.UnsafeCounter_Benchmark.unsafeCount avgt 5 9.088 ± 5.964 us/op
o.k.u.u.UnsafeCounter_Benchmark.unsafeGACount avgt 5 2.611 ± 0.164 us/op
o.k.u.u.UnsafeCounter_Benchmark.unsyncCount avgt 5 1.047 ± 0.050 us/op
The most of measurement as I expect, except UnsafeCounter_Benchmark.unsafeCount
which is used sun.misc.Unsafe.compareAndSwapLong
with while
loop. It the the slowest locking.
public void increment() {
long before = counter;
while (!unsafe.compareAndSwapLong(this, offset, before, before + 1L)) {
before = counter;
}
}
I suggest that low performance is because of while loop and JMH makes higher contention, but when I've checked correctness by Executors
I get figures as I expect:
Counter result: UnsyncCounter 97538676
Time passed in ms:259
Counter result: AtomicCounter 100000000
Time passed in ms:1805
Counter result: LockCounter 100000000
Time passed in ms:3904
Counter result: SyncNoVolatileCounter 100000000
Time passed in ms:14227
Counter result: SyncVolatileCounter 100000000
Time passed in ms:19224
Counter result: UnsafeCASCounter 100000000
Time passed in ms:8077
Counter result: UnsafeGACounter 100000000
Time passed in ms:2549
Correctness test code:
public class UnsafeCounter_Test {
static class CounterClient implements Runnable {
private Counter c;
private int num;
public CounterClient(Counter c, int num) {
this.c = c;
this.num = num;
}
@Override
public void run() {
for (int i = 0; i < num; i++) {
c.increment();
}
}
}
public static void makeTest(Counter counter) throws InterruptedException {
int NUM_OF_THREADS = 1000;
int NUM_OF_INCREMENTS = 100000;
ExecutorService service = Executors.newFixedThreadPool(NUM_OF_THREADS);
long before = System.currentTimeMillis();
for (int i = 0; i < NUM_OF_THREADS; i++) {
service.submit(new CounterClient(counter, NUM_OF_INCREMENTS));
}
service.shutdown();
service.awaitTermination(1, TimeUnit.MINUTES);
long after = System.currentTimeMillis();
System.out.println("Counter result: " + counter.getClass().getSimpleName() + " " + counter.getCounter());
System.out.println("Time passed in ms:" + (after - before));
}
public static void main(String[] args) throws InterruptedException {
makeTest(new UnsyncCounter());
makeTest(new AtomicCounter());
makeTest(new LockCounter());
makeTest(new SyncNoVolatileCounter());
makeTest(new SyncVolatileCounter());
makeTest(new UnsafeCASCounter());
makeTest(new UnsafeGACounter());
}
}
I know that it is very awful test, but in this case Unsafe CAS two times faster than Sync variants and everything goes as expected. Could somebody clarify described behavior? For more information please see GitHub repo: Bench, Unsafe CAS counter
Thinking out loud: it is remarkable how often people do 90% of the tedious work, and leave the 10% (where the fun begins) for someone else! All right, I'm taking all the fun!
Let me repeat the experiment first on my i7-4790K, 8u40 EA:
Truly, something seems fishy about
unsafeCount
test. Really, you have to presume all data is fishy before you validated it. For nanobenchmarks, you have to validate the generated code to see if you actually measure something you want to measure. In JMH, it is very quickly doable with-prof perfasm
. In fact, if you look at the hottest region ofunsafeCount
there, you will notice a few funny things:Translation: a)
offset
field gets re-read on each iteration -- because CAS memory effects imply volatile read, and therefore the field needs to be pessimistically re-read; b) the hilarious part is thatunsafe
field is also being re-read for a typecheck -- for the same reason.This is why high-performance code should look like this:
If you do that, the
unsafeCount
performance boosts right up:...which is fairly close to synchronized tests now, given the error bounds. If you look at the
-prof perfasm
now, this is anunsafeCount
loop:This loop is very tight, and it seems nothing can make it go faster. We spend most of the time loading the "updated" value and actually CAS-ing it. But we contend a lot! To figure out if contention is the leading cause, let's add backoffs:
...running:
Voila. We do more work in the loop, but it saves us from contending a lot. I tried to explain this before in "Nanotrusting the Nanotime", it might be good to go back there and read up more on benchmarking methodology, especially when heavy-weight operations are measured. This highlights the pitfall in the entire experiment, not only with
unsafeCount
.Exercise for the OP and interested readers: explain why
unsafeGACount
andatomicCount
perform much faster than other tests. You have the tools now.P.S. Running N threads on machine having C (C < N) threads is silly: you might think you have "contention" with N threads, but instead you are running and "contending" C threads only. It is especially amusing when people do 1000 threads on 4 core machine...
P.P.S. Time check: 10 minutes to do the profiling and additional experiments, 20 minutes to write it up. And how much time you wasted replicating the result by hand? ;)