Does getOpaque/order_relaxed/read_once have influence on the processor, or just the compiler during memory hoisting?

73 views Asked by At

I've had some discussion with multiple people on this issue, and there are some points that makes the usage of memory ordering fences on load situations somewhat confusing.

The first bullet point seems to be:

  • a) The level of processor reordering is limited by the amount of method de-virtualization done by the compiler.

    • a.1) The processor can reorder instructions WITHIN a virtual method, BUT NOT OUT of the virtual method (unless de-virtualized).

    • a.2) The processor CAN reorder virtual method calls, between OTHER virtual method calls... AS LONG as these methods DO NOT depend on each other (no reference interdependence).

  • b) Optimizations such as parallelization ARE DONE on specific mathematical operations done on ALU, and NEVER performed on programmer's "macro" logic code... even if it could.

  • c) The processor would NEVER do something outlandish like "memory hoisting", these events are ALWAYS a compiler side responsibility. (see "d")

  • d) Register promotion (prosessor's bare-bones "memory hoisting" mechanic) can only occur on primitives and require explicit instructions.

    • d.1) My assumption is that libraries such as Java's Math. would make use of such optimizations, but outside this, compilers will never make a processor promote a register UNLESS NO OTHER register is involved in the concurrent mathematical operation.

On Java's compiler...

  • e) Java's compiler and (JIT) implementations NEVER parallelize COMPLEX interdependencies, even if they could.

    • e.1) Only on simple loops where no interdependencies are required.... like Arrays library Array/Collection pure-functional construction.
  • f) Activities such as memory hoisting are done on this level, NOT on the processor level.

      • Explanation of f: just think about it, If there is NO specific instruction to promote or prevent register promotion on memories other than primitives on the processor level, then latency prevention can only be achieved in the compiler level via hoisting... WHILE actions like double-checking DO INVOLVE the processor since the processor has the means to reorder and simplify the code... damaging the double-check in the process.

Now let's examine the argument:

"Ok, it works now, but... what about weakly memory ordered processors?"

It seems to me that volatile AND getAcquire loads are "levers" meant to handle COMPILER + processor reordering BEHAVIOR... volatile(seq_const) and acquire DO HAVE influence over previous and subsequent loads and stores....

BUT getOpaque/memory_ordered_relaxed is JUST a compiler handle.... that does NOTHING on the processor level... depending on the case, either it be hoisting-prevention... or simplification-prevention.

Since ALL it prevents is hoisting, then processors have no idea a getOpaque was even used...(?), depending on the case.

The biggest clue about my assumption is the name the lynux kernel defines the fence... as read_once(), as to tell the compiler "This will not be read more than once, please do NOT HOIST-optimize"


So, de-virtualization is a behavior that "almost never happens" according to some answers on this site, and even if it DOES, it will only occur on situations on which the scope is "contextually aknowledged"... aka private/protected methods (inheritance by extension...)

This exempts: public methods (relative to an outer scope), lambdas and static methods(not so sure about this one tbh).

In the next code I will try to demonstrate how a specific syntax will prevent load hoisting, by taking advantage of method virtualization. Note: On double-checking getOpaque/relaxed is still required.

IF... for the code:

   int plain = 0;
   final Executor executor;

   Runnable readRunnable = () -> {
      T localPlain = this.plain; //One important factor of the why this will always work
                                 // properly is HOW the load is the first thing 
                                 // to happen in the stack.
                                 // Even if the load is done within the 'while' of a CAS.
                                 // The load is never hoisted, but...
                                 // "Ok, it works now, but... what about weakly memory ordered processors?"
      print(localPlain);
   };

   long delay = 20; //Millis
   MyExecutor exec = new MyExecutor(executor, delay, readRunnable); //This one's built in the constructor, please pardon my laziness...

   public void read() {
      exec.execute(); // will execute the inner final runnable.
   }
    
   public static void main() {
      Executor ex = Executors...
      PlainClass pc = new PlainClass(ex);
      print(pc.plain); // will read 0;
      pc.read(); // will read 3;
      pc.setPlain(3);
   }
    
   public void setPlain(int value) {
      this.plain = value;
   }

The source of the pc.plain read... even if coming from the same register, the compiler will be unable to hoist the load (even if executed inside a loop), simply because the load is hidden inside the de-reference of a virtual function call.

The runnable instance is decoupled from the executor, so that it will not be devirtualized/inlined when the MyExecutor class is compiled.

The read is then performed by 2 levels of indirection and the load will not be hoisted when main performs the 2 sequential prints.


Using getOpaque to prevent this... in this case will only incur in the overhead of the introspection + cast (I mean even volatile is preferred instead of getOpaque tbh).

And, because memory hoisting is a pure compiler behavior, the argument of "Ok, it works now, but... what about weakly memory ordered processors?" has no impact, since the compilers will remain the same even if the processor changes...

So, I think, the real question would be, What makes memory_order_relaxed/read_once/getOpaque so special in the eyes of the processor?

Alternately If getOpaque's usage is widely done to prevent simplification/hoisting (during for loop inlining) or in recurrent loads, then the behavior makes sense to be implemented more as a pair of reordering atomic fences within the jump loop, than as a "targeted" load/store reordering rule made on a specific register,... unless this is what is actually happening under the hood with relaxed...:

//pardon my pseudo code


   T local;

   // ^^^^...Keep reordering above this point...^^^

   atomic_fence(); //Nothing above can go bellow and viceversa.
   local = fieldVal;
   atomic_fence(); //Nothing bellow can go above and viceversa.

   // vvv...Keep reordering bellow this point...vvv

   return local;

So what I believe is that the hoisting behavior is NOT related to the reordering one. So memory_order_relaxed and homologous instructions are doing 2 jobs.

  • ONE) to communicate the compiler to prevent compiler hoisting + reordering.
  • SECOND) to prevent reordering (hence simplification) during processor reordering.

Will my code work to perform double-checking? absolutely NOT.

Will it work as a concurrent proactive load of a given register... absolutely... but... what about weakly memory ordered processors?

0

There are 0 answers