Usually, lists are implemented either as linked lists, which are slow to traverse, or as array lists, which are slow when inserting elements.
I was wondering if it would be possible to use the processor's MMU to implement lists more efficiently, by remapping instead of copying memory whenever an element is inserted or deleted. This would mean that both indexing and insertion/deletion anywhere in the array to have an speed of O(1), better than any other list implementation.
My questions are:
- Are programs actually able to control their own virtual memory, or would changes to the OS need to be made?
- Is there a limit to the number of page table entries per process? Does memory access get slower with more entries?
- Is changing page table entries so slow that this would be more efficient only for very large lists?
- Are there any existing implementations of this type of list? If yes, what stops people from using them more?
First some specific answers to your questions:
mmap
on UNIX-like OSes and similar APIs on Windows. Linux in particular has recently added several methods to allow advanced manipulation of user-visible buffers from the kernel without copying — but one of the interesting ones is no longer for this world (at least performance-wise).Discussion
The core problem with using MMU tricks is that (a) you can only "zero copy" whole pages, which pretty much means at a 4K granularity or larger, along with similarly restrictive alignment (b) even if
mmap
-type calls are fast, so are efficient memory copy routines!Let's look first at (a). If I understand you correctly, you want to accelerate insertion into something like
std::vector
by using MMU tricks to shift the elements that need to be moved when an insert happens. The problem is you can only shift by 0, 4096, 8192, etc bytes on typical systems! So if you insert one 4-byteint
into avector<int>
how does that help? You could perhaps "crack apart" the underlying storage of thevector
into two parts at the insertion point and track that with the hope of merging them again some point (e.g, if you insert 4096 bytes worth of stuff) - but you end up with a different data structure, with different properties, and the MMU tricks aren't really fundamental here anyway.That brings us to (b). Take for granted that on my machine a page can be remapped in ~120 ns (via
mmap
). That seems fast (it's not bad when you consider it involves taking various kernel locks, messing with the page tables, adding a VMA, etc) — but copying memory is also very fast. On this same box, I can copy in-memory (i.e., to/from RAM of any cache level) at about 12 GB/s, while copies in L1 or L2 proceed at perhaps 80-100 GB/s. So copying a 4K page takes somewhere between 41 ns (cached) and 340 ns (uncached, to RAM). So messing with page tables isn't a clear win even if it would be possible, especially in the cached case (and the cached case is probably the dominant one, averaging over most workloads).So these types of tricks may make sense, but only in specific scenarios, such as the following:
Realloc
The most common and useful example of MMU tricks is probably
realloc
. On Linux and Windows (it seems?),realloc
may be implemented by remapping and extending the mapped pages in memory (aka MMU tricks), which both avoids the physical copy and the need to temporarily have both the old allocated region and the new region "live" at once (which may be hard if their sum approaches the size of physical memory).In particular, recent version of Linux will likely use
mremap
torealloc
heap regions that weremmap
ed in the first place (by default, this occurs for allocation requests larger than 128K, but it may also occur when the space available tosbrk
is exhausted).