I've had this question for quite a while now, trying to read lots of resources and understanding what is going on - but I've still failed to get a good understanding of why things are the way they are.
Simply put I'm trying to test how a CAS
would perform vs synchronized
in contended and not environments. I've put up this JMH
test:
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@State(Scope.Benchmark)
public class SandBox {
Object lock = new Object();
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder().include(SandBox.class.getSimpleName())
.jvmArgs("-ea", "-Xms10g", "-Xmx10g")
.shouldFailOnError(true)
.build();
new Runner(opt).run();
}
@State(Scope.Thread)
public static class Holder {
private long number;
private AtomicLong atomicLong;
@Setup
public void setUp() {
number = ThreadLocalRandom.current().nextLong();
atomicLong = new AtomicLong(number);
}
}
@Fork(1)
@Benchmark
public long sync(Holder holder) {
long n = holder.number;
synchronized (lock) {
n = n * 123;
}
return n;
}
@Fork(1)
@Benchmark
public AtomicLong cas(Holder holder) {
AtomicLong al = holder.atomicLong;
al.updateAndGet(x -> x * 123);
return al;
}
private Object anotherLock = new Object();
private long anotherNumber = ThreadLocalRandom.current().nextLong();
private AtomicLong anotherAl = new AtomicLong(anotherNumber);
@Fork(1)
@Benchmark
public long syncShared() {
synchronized (anotherLock) {
anotherNumber = anotherNumber * 123;
}
return anotherNumber;
}
@Fork(1)
@Benchmark
public AtomicLong casShared() {
anotherAl.updateAndGet(x -> x * 123);
return anotherAl;
}
@Fork(value = 1, jvmArgsAppend = "-XX:-UseBiasedLocking")
@Benchmark
public long syncSharedNonBiased() {
synchronized (anotherLock) {
anotherNumber = anotherNumber * 123;
}
return anotherNumber;
}
}
And the results:
Benchmark Mode Cnt Score Error Units
spinLockVsSynchronized.SandBox.cas avgt 5 212.922 ± 18.011 ns/op
spinLockVsSynchronized.SandBox.casShared avgt 5 4106.764 ± 1233.108 ns/op
spinLockVsSynchronized.SandBox.sync avgt 5 2869.664 ± 231.482 ns/op
spinLockVsSynchronized.SandBox.syncShared avgt 5 2414.177 ± 85.022 ns/op
spinLockVsSynchronized.SandBox.syncSharedNonBiased avgt 5 2696.102 ± 279.734 ns/op
In the non-shared case CAS
is by far faster, which I would expect. But in shared case, things are the other way around - and this I can't understand. I don't think this is related to biased locking, as that would happen after a threads holds the lock for 5 seconds (AFAIK) and this does not happen and the test is just proof of that.
I honestly hope it's just my tests that are wrong, and someone having jmh
expertise would come along and just point me to the wrong set-up here.
The main misconception is the assumption that you are comparing “
CAS
vs.synchronized
”. Given, how sophisticated JVMs implementsynchronized
, you are comparing the performance of aCAS
-based algorithm usingAtomicLong
with the performance of theCAS
-based algorithm used to implementsynchronized
.Similar to
Lock
, the internal information for an object monitor basically consist of anint
status telling whether it has been owned and how often it is nested, a reference to the current owner thread and a queue of threads waiting to be able to acquire it. The expensive aspect is the waiting queue. Putting a thread into the queue, removing it from thread scheduling, and eventually waking it up when the current owner releases the monitor, are operations that can take a significant time.However, in the uncontended case, the waiting queue is, of course, not involved. Acquiring the monitor consist of a single
CAS
to change the status from “unowned” (usually zero) to “owned, acquired once” (guess the typical value). If successful, the thread can proceed with the critical action, followed by a release which implies just writing the “unowned” state with the necessary memory visibility and waking up another blocked thread, if there is one.Since the wait queue is the significantly more expensive thing, implementations usually try to avoid it even in the contended case by performing some amount of spinning, making several repeated
CAS
attempts before falling back to enqueuing the thread. If the critical action of the owner is as simple as a single multiplication, chances are high that the monitor will be released during the spinning phase already. Note thatsynchronized
is “unfair”, allowing a spinning thread to proceed immediately, even if there are already enqueued threads waiting far longer.If you compare the fundamental operations performed by the
synchronized(lock){ n = n * 123; }
when no queuing is involved and byal.updateAndGet(x -> x * 123);
, you’ll notice that they are roughly on par. The main difference is that theAtomicLong
approach will repeat the multiplication on contention while for thesynchronized
approach, there is a risk of being put into the queue if no progress has been made during spinning.But
synchronized
allows lock coarsening for code repeatedly synchronizing on the same object, which might be relevant for a benchmark loop calling thesyncShared
method. Unless there’s also a way to fuse multipleCAS
updates of anAtomicLong
, this can givesynchronized
a dramatic advantage. (See also this article covering several aspects discussed above)Note that due to the “unfair” nature of
synchronized
, creating far more threads than CPU cores doesn’t have to be a problem. In the best case, “number of threads minus number of cores” threads end up on the queue, never waking up, while the remaining threads succeed in the spinning phase, one thread on each core. But likewise, threads not running on a CPU core can’t reduce the performance of theAtomicLong
update as they can neither, invalidate the current value to other threads nor make a failedCAS
attempt.In either case, when
CAS
ing on the member variable of an unshared object or when performingsynchronized
on an unshared object, the JVM may detect the local nature of the operation and elide most of the associated costs. But this may depend on several subtle environmental aspects.The bottom line is that there is no easy decision between atomic updates and
synchronized
blocks. Things get far more interesting with more expensive operations, which may raise the likelihood of threads getting enqueued in the contended case ofsynchronized
, which may make it acceptable that the operation has to be repeated in the contended case of an atomic update.