ARM LL/SC exclusive access by register width or cache line width?

2.8k views Asked by At

I'm working on the next release of my lock-free data structure library, using LL/SC on ARM.

For my use-case of LL/SC, I need to use it with a single STR between the LDREX and STREX. (Rather than using it to emulate CAS.)

Now, I've written the code and this works. What concerns me however is the possibility it may not always work. I've read on PowerPC if you access the same cache line as the LL/SC target, you break the LL/SC.

So I'm thinking if my STR target is on the same cache line as my LL/SC target, then pow, I'm dead.

Now, the LL/SC target and STR targets are always in different malloc()s so the chance of them being directly in the same cache line is probably small (and I can guarantee this by padding the LL/SC target so it begins on a cache line boundary and fills that cache line).

But there could be false sharing, if the STR target is in just the right (wrong!) place in memory.

Looking at the LDREX/STREX documentation, this describes exclusive access in terms of "the physical address". This implies register width granularity, not cache line width granularity.

And that's my question - is LDREX/STREX sensitivity to memory access using register width granularity or cache line width granularity?

2

There are 2 answers

5
llongi On BEST ANSWER

ARM uses Exclusive Monitors to implement exclusive access to memory via load-linked/store-conditional. [1] has all the details, of importance here I'd say is the following:

Exclusives Reservation Granule

When an exclusive monitor tags an address, the minimum region that can be tagged for exclusive access is called the Exclusives Reservation Granule (ERG). The ERG is implementation defined, in the range 8-2048 bytes, in multiples of two bytes. Portable code must not assume anything about ERG size.

So you're kinda out of luck there as I see it. Most real implementations will probably keep a small value anyway, but it's not guaranteed by the basic ARM architecture as far as I can tell, but maybe someone with more experience will find me wrong. :) Still, kinda all implementations out there of LL/SC are some form of weak-LL/SC, so you can almost never be completely sure that a store between the LL and the SC won't kill the SC always, or most of the time, or maybe never... It's just so much architecture and implementation dependent that I personally stick to using LL/SC to implement CAS in a tight loop and use that as usual and be done with it.

[1] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0008a/CJAGCFAF.html

7
old_timer On

Note, LDREX/STREX do not do what many folks think they do. They are for multiprocessor systems, uniprocessor systems should consider using swap. ARM docs are usually very good, but in this particular case they have a huge gap. Linux has been using these instructions improperly and that has been noted by companies with uniprocessor ARM cores (Linux has MANY ARM related bugs due to folks adding code without the proper research, every version that comes out has to be repaired). If you have the L1 cache on a uniprocessor system you are okay because the cache supports that access type, if it hits the AXI bus the AMBA/AXI spec tells hardware engineers that for uniprocessor systems you dont need to support that transaction type. Unfortunately the ARM ARM/TRM tells software engineers that you should stop using swap and start using LDREX/STREX, which is not consistent, one side told dont do this the other side told do do this and nothing good comes of it.

This is not an answer to your question just general information about those instructions to try to educate folks on proper use and risks involved. (Yes, been there, done, that got burned by the use of those instructions, had to patch linux (on top of other linux patches))

EDIT....more detail

In the ARM ARM:

Historically, support for shared memory synchronization has been with the read-locked-write operations
that swap register contents with memory; the SWP and SWPB instructions described in...

...

ARMv6 provides a new mechanism to support more comprehensive non-blocking shared-memory synchronization primitives
that scale for multiple-processor system designs.

...

The swap and swap byte instructions are deprecated in ARMv6. It is recommended that all software
migrates to using the new synchronization primitives.

...

Uniprocessor systems are only required to support the non-shared memory model, allowing them to support
synchronization primitives with the minimum amount of hardware overhead.

...

Multi-processor systems are required to implement an address monitor for each processor.


STREX:

<Rd> Specifies the destination register for the returned status value. The value returned is:
0  if the operation updates memory
1 if the operation fails to update memory.


MemoryAccess(B-bit, E-bit)
if ConditionPassed(cond) then
  processor_id = ExecutingProcessor()
  physical_address = TLB(Rn)
  if IsExclusiveLocal(physical_address, processor_id, 4) then
    if Shared(Rn) == 1 then
      if IsExclusiveGlobal(physical_address, processor_id, 4) then
        Memory[Rn,4] = Rm
        Rd = 0
        ClearExclusiveByAddress(physical_address,processor_id,4)
      else
        Rd = 1
    else
      Memory[Rn,4] = Rm
      Rd = 0
  else
  Rd = 1
ClearExclusiveLocal(processor_id)

AMBA/AXI spec

The ARLOCK[1:0] or AWLOCK[1:0] signal selects exclusive access, and the RRESP[1:0] 
or BRESP[1:0] signal (see Table 7-1 on page 7-2) indicates the success or failure 
of the exclusive access.

...

If the master attempts an exclusive read from a slave that does not support exclusive
accesses, the slave returns the OKAY response instead of the EXOKAY response. The
master can treat this as an error condition indicating that the exclusive access is not
supported. It is recommended that the master not perform the write portion of this
exclusive operation.

...

b00 OKAY
b01 EXOKAY

...

ARLOCK/AWLOCK

b00 normal access
b01 exclusive access

So the software side, ARM ARM is telling us use LDREX/STREX instead of swap, in part because it scales to multi-processor, shared memory, systems. BUT they also tell us that uniprocessor systems are not required to support shared memory synchronization. So even from the software side there is a clue that you should think twice about it...

We know from the description of STREX, that if it returned exclusive rd = 0 then it worked. if rd = 1 then it was not exclusive (or other reasons). LDREX and STREX are done in pairs, the shared memory system logic is looking for the pair at the same address and the hardware verifies there was no other access to that address between the two. Who are you worried about getting in between the two? 1) you if you interrupt/swap and are damn lucky 2) another processor using that memory. What linux does, from what I remember is go into a tight infinite loop,

while(1)
{
  ldrex
  strex
  if(rd==0) break;
}

Now on a uniprocessor system, both the ARM ARM suggestion that they dont need to support shared access because it is simpler (why would you need to add that complexity?).

What you dont see as a programmer. the ARLOCK or AWLOCK is set for the ldrex and strex, if you are implementing shared access then you care about these bits. If you are implementing shared access then you return EXOKAY to the strex if there were no accesses between the two. EXOKAY is a b01 which in the strex pseudo code is the exclusive global, and rd = 0. If the hardware returns OKAY, b00, it was not exclusive and rd = 1 for strex. Then the AMBA/AXI spec says it is okay to return OKAY for exclusive accesses if you dont support a shared system. So on a uniprocessor that has not implemented exclusive accesses strex can and/or will ALWAYS return OKAY, never EXOKAY. which means the strex will NEVER get rd = 0 and linux hangs in the infinite loop.

The true linux bug here is that the code we saw at the time said if(ARMv6 or newer) then use LDREX/STREX, else use SWP. to fix the bug if(ARMv6 or newer and multiprocessor) then use LDREX/STREX, else use SWP.

This translates to anyone else who wants to use LDREX/STREX for any other reason which is what caught my eye in this ticket.

Now you ask, what does the cache have to do with it? the L1 cache is inside the processor core, it doesnt go out on the AXI/AMBA bus. It returns EXOKAY for an strex, and/or it fully implements sharing. So if the L1 cache is on, then you will get an EXOKAY (first time or eventually, I am not sure).

Now you ask, what if there is a cache miss? Well first off if L1 cache is off, then it hits the L2 cache boundary without the cacheable bits on. So the L2 cache will pass it on as is, and it goes out as exclusive. With the L1 cache on, and a hit it returns EXOKAY as above (eventually or always, dont know) If L1 is a miss then L1 does a cache line fill, it does a cacheable NOT LOCKED read. Which causes L2 to either hit or miss, if L2 misses, then it goes out to the vendor specific logic which in this case returns OKAY but that is okay because it was not LOCKED anyway it was a normal access. Once the l2 and l1 are filled then the L1 performs the original transfer and returns EXOKAY.

Now here is the kicker, well first it is a waste and a risk to implement this in hardware, so I would expect uniprocessor ARMv6 and newer to not return EXOKAY, you have to test that on a case by case basis. The second, it is a PITA to get linux running with the cache off. That took a bit of work in fact. So you are not likely to see this in linux normally. But the problem is there, people have seen it and any time you use those instructions yourself you should be careful to use them properly. It should be painfully simple using bare-metal programming to test a system to see if it is going to hang, should take a few seconds/minutes to write the code. It likely takes longer to put the system in a state where you can try that code (interrupt the bootloader, jump in with jtag, etc).