Why processor read only aligned addresses

315 views Asked by At

So I was reading few articles (and few questions on StackOverflow) about memory alignment and I understood why structs like this:

struct A
{
  char c;
  int i;
}

will have padding. Also it is clear that fetches from not aligned memory will be slower if processor can read only from aligned offsets.

But why processor can read only from aligned memory? Why it can't just read data from random address? You know, from Random-Access Memory...

2

There are 2 answers

1
user3344003 On

I depends upon the processor. Some processors do not allow unaligned accesses at all. Other processors can, but it tends to be slower.

In your example (if the fields are packed and unaligned access it permitted, if the processor tries to read i, it usually takes fetches from memory to get it.

Some processors take the performance hit of multiple accesses to retrieve unaligned data. Others simply do not allow it.

4
Angelicos Phosphoros On

Processors are optimized to most common usecase: when data is correctly aligned. This is a reason why aligned reads are preferred.

But this doesn't answer your question. What makes aligned read more effective?

Answer is, that modern processors don't like to read from memory. They read from their cache which is in orders of magnitude faster.

Those caches organized using cache lines. Typical cache line is 64 byte and it is always aligned to its size, you can read exact values here. When you read aligned value, there is guarantee that value which size less or equal to cache line would be in single cache line. For unaligned value it is possible that part of the value in cache while other part accessible only from RAM.

UPDATE: More detailed explanation.

Consider such code running on modern x86_64 processor with 64 byte cache lines:

movq rax, qword ptr [rdi]

It is reading 64 bit number from pointer rdi to the register rax.

If rdi is properly aligned (meaning it is divisible by 8), it is guaranteed to be in single cache line. See diagram with visualization of all 8 possible locations:

  CACHE LINE
+--------------------------------------------------------------------------+
|  0 <- offset       8                 16                24                |
| +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+  |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |  |
| +-+-+-+-+-+-+-+-- +-+-+-+-+-+-+-+-- +-+-+-+-+-+-+-+-- +-+-+-+-+-+-+-+--  |
|                                                                          |
|  32                36                44                52                |
| +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+  |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |  |
| +-+-+-+-+-+-+-+-- +-+-+-+-+-+-+-+-- +-+-+-+-+-+-+-+-- +-+-+-+-+-+-+-+--  |
|                                                                          |
+--------------------------------------------------------------------------+

So in case of cache miss, it would be limited to single one because when cache is fetched, it fetches whole 8 bytes of our value at once.

With misaligned value, we can have situation like this:

+--------------------+------------------+
|Cache line 1        | Cache line 2     |
|            +-+-+-+---+-+-+-+          |
|            | | | | | | | | |          |
|            +-+-+-+---+-+-+-+          |
|                    |                  |
+--------------------+------------------+

If our value crosses boundary between cache lines, CPU would need to load twice more data from memory (2 whole cache lines).

Same thing applies to page faults though chances of a particular read causing a 2 TLB misses quite rare, I expect.

Also, when you do memory writes to misaligned value, it can be even worse because there are protocols of CPU cache invalidation, and such writes would invalidate both cache lines. It would be worse if such cache line is often accessed from multiple threads. In fact, many efficient synchronization mechanisms that use atomic integers often align them to the 64bit alignment to avoid unnecessary cache invalidations.

See, for example, crossbeams CachePadded.