Using MMU to implement resizable arrays

1.3k views Asked by At

Usually, lists are implemented either as linked lists, which are slow to traverse, or as array lists, which are slow when inserting elements.

I was wondering if it would be possible to use the processor's MMU to implement lists more efficiently, by remapping instead of copying memory whenever an element is inserted or deleted. This would mean that both indexing and insertion/deletion anywhere in the array to have an speed of O(1), better than any other list implementation.

My questions are:

  • Are programs actually able to control their own virtual memory, or would changes to the OS need to be made?
  • Is there a limit to the number of page table entries per process? Does memory access get slower with more entries?
  • Is changing page table entries so slow that this would be more efficient only for very large lists?
  • Are there any existing implementations of this type of list? If yes, what stops people from using them more?
1

There are 1 answers

4
BeeOnRope On BEST ANSWER

First some specific answers to your questions:

  • Yes, on many OSes programs have significant control over their virtual memory, e.g., mmap on UNIX-like OSes and similar APIs on Windows. Linux in particular has recently added several methods to allow advanced manipulation of user-visible buffers from the kernel without copying — but one of the interesting ones is no longer for this world (at least performance-wise).
  • Usually, there are no specific limits to the number of page table entries per-process. Of course, you might run into other limits, such as per-process memory limits, physical memory limits and so on. Access to memory generally does not get slower with more entries. Of course, access to more pages in total may imply slower access (e.g., because you exceed the size of the TLB) - but that's not directly a function of more pages. The page table entries themselves are just sitting in RAM so you can have millions of them without an issue.
  • Changing page table entries is reasonably fast on moderns OSes. For example, on my laptop, changing page table entries seems to take around ~120 ns per page (plus some fixed overhead for a system call).
  • Yes, you can find examples out there, but they generally target fairly narrow scenarios. In fact, you can see that the mach's libc tries to use use MMU tricks for no less an important routine than memcpy!

Discussion

The core problem with using MMU tricks is that (a) you can only "zero copy" whole pages, which pretty much means at a 4K granularity or larger, along with similarly restrictive alignment (b) even if mmap-type calls are fast, so are efficient memory copy routines!

Let's look first at (a). If I understand you correctly, you want to accelerate insertion into something like std::vector by using MMU tricks to shift the elements that need to be moved when an insert happens. The problem is you can only shift by 0, 4096, 8192, etc bytes on typical systems! So if you insert one 4-byte int into a vector<int> how does that help? You could perhaps "crack apart" the underlying storage of the vector into two parts at the insertion point and track that with the hope of merging them again some point (e.g, if you insert 4096 bytes worth of stuff) - but you end up with a different data structure, with different properties, and the MMU tricks aren't really fundamental here anyway.

That brings us to (b). Take for granted that on my machine a page can be remapped in ~120 ns (via mmap). That seems fast (it's not bad when you consider it involves taking various kernel locks, messing with the page tables, adding a VMA, etc) — but copying memory is also very fast. On this same box, I can copy in-memory (i.e., to/from RAM of any cache level) at about 12 GB/s, while copies in L1 or L2 proceed at perhaps 80-100 GB/s. So copying a 4K page takes somewhere between 41 ns (cached) and 340 ns (uncached, to RAM). So messing with page tables isn't a clear win even if it would be possible, especially in the cached case (and the cached case is probably the dominant one, averaging over most workloads).

So these types of tricks may make sense, but only in specific scenarios, such as the following:

  • You have some way to handle the fact that page mapping can only move/copy/shift things around in chunks of the page granularity, e.g., because your structures happen to be a multiple of page granularity, or you use batch inserts which are a multiple of page granularity, etc.
  • You have some way to map pages around more quickly: e.g., using 2MB pages rather than 4K pages, or by writing some kernel-side code that accelerates your use case.
  • You want to use even fancier tricks than simply moving memory, e.g,. making the same data appear in two places at once, implementing COW structures, or something like that.

Realloc

The most common and useful example of MMU tricks is probably realloc. On Linux and Windows (it seems?), realloc may be implemented by remapping and extending the mapped pages in memory (aka MMU tricks), which both avoids the physical copy and the need to temporarily have both the old allocated region and the new region "live" at once (which may be hard if their sum approaches the size of physical memory).

In particular, recent version of Linux will likely use mremap to realloc heap regions that were mmaped in the first place (by default, this occurs for allocation requests larger than 128K, but it may also occur when the space available to sbrk is exhausted).