I am about to write a class representing a double-ended queue, just like std::dequeue, but with the capability to store any trivially destructible type, and without indexing support. Iteration or pop operations will only work knowing the types stored before. Most of the time it will be used as a queue/stack like storage, referenced by other types, as references are guaranteed to stay valid, even if it is pushed/popped to either end in between. Allocation and freeing of memory should be done in huge blocks. The systems page size seems to be a perfect fit for a default block size.
I assume, that page aligned allocation of the page sized memory chunks would make sense in order to reduce cache misses on iteration over the elements of the container. Is this true?
I'm afraid using std::aligned_alloc() with page alignment and size will lead to the heap's metadata (which for example is later required by free(), to know the size of the allocated memory) being stored in front of allocated memory, which would lead to a huge waste of memory (nearly one page for each allocated page). Or is there any optimization or different API for page-aligned memory allocations of the heap? For example, I could imagine an API allowing the client to specify where to store the required metadata, or which returns a pointer to the memory and a pointer to the metadata on allocation.
On the other hand, using the systems native APIs (Windows / Posix / FreeRTOS) to allocate pages would require a separate memory pool, which would be a heap optimized for page-aligned memory allocations. But having two heaps not knowing about each other could lead to memory waste too, as both will have a pool of preallocated pages. Or do most standard library implementations free pages to the operating system as soon as they are not used by clients anymore?
For cache hit-and-miss optimization, page-aligned blocks will not help you much. At that level, you should look for cache memory line alignment. On x86 CPUs that will be 64 bytes.
Page alignment may help you in optimizing the amount of RAM being allocated from the operating system. That is dependent on the size of blocks your allocator will claim from new or malloc at a time. Losses in that area are usually fairly small, or nil. If using OS calls, compute the amount of wasted memory, using this formula:
For best cache optimization, align individual elements, including the list related pointers, on a cache line boundary, again, 64 bytes for x86. This is where you will gain the most performance. Beware! That is also where most of the waste in memory usually comes from.
The waste of memory in that area is