Are implementations allowed to have new and/or malloc allocate far more memory than requested, so it can avoid overhead for later small allocations?

In my experience, no one ever allocates single objects on the heap due to how costly it is, usually writing small object allocators or simply creating large arrays where possible. So an implementation doing this for the programmer feels like it should be an easy ergonomics/performance feature to implement.

Do compilers already do this, or does the standard or another issue prevent this?

3 Answers

1
DevSolar On Best Solutions

Most operating systems [citation needed] manage memory in chunks, usually called "pages". This is an artifact of the underlying hardware.

It is long-established practice that a library's malloc (and, by extension, new) would satisfy a user's request for memory by allocating one or more "pages" of memory from the operating system, and then parcel out that memory to the user.(*) Subsequent requests that could be satisfied without having to request more pages from the OS would be satisfied that way.

The gory details vary from system to system and from allocator to allocator. They usually attempt to strike a balance between speed (of allocations / deallocations) and efficiency (of minimal memory usage).

Also traditional is that applications that have specific memory requirements (speed, efficiency) do malloc a big chunk of memory in one go, and then do the management of that memory on their own. This adds complexity and more chances for bugs (e.g. memory allocated through the application's memory management but free()d, or memory malloc()ed but freed through the application memory management, or an error in the application's memory management itself), but allows the application to control the algorithms used.

C++ makes this easier through allocators, which effectively "outsource" a container's memory management to a different class, allowing to employ customized, re-usable memory management classes.

So:

  1. Yes this is possible.
  2. No, nothing in the standard forbids it.
  3. Yes, this is usually already done "under the hood".

The corollary to 3. is, of course, the old truism measure, optimize, measure. (Don't try to optimize away a problem you do not have, and if you do, make sure your optimization actually improved things instead of making them worse.)


(*) The hardware that introduces the concept of "pages" is the same that does protect separate application's memory spaces from each other -- the Memory Management Unit. To avoid applications subverting that protection, only the operating system is allowed to modify the allocation of memory. Terminology and architecture differs, but there is usually some kind of "supervisor mode" that is only available to the OS kernel, so an application has to trigger the kernel, which then does the allocation, and then returns control to the application.

This is called a "context switch", and in terms of CPU time, it's among the most expensive operations there are. So from the very beginning, library implementors looked for ways to minimize context switches. That's why malloc and new are usually rather well-optimized for general usage already.

1
P.W On

Do compilers already do this, or does the standard or another issue prevent this?

The standard does not prevent the allocation functions from allocating more than what is requested. It only states that successful allocation means that memory allocated will be at least as large as the size requested.

Quoting C++ standard (n4659),

6.7.4.1 Allocation functions [basic.stc.dynamic.allocation]
...
2. The allocation function attempts to allocate the requested amount of storage. If it is successful, it shall return the address of the start of a block of storage whose length in bytes shall be at least as large as the requested size.

0
PSkocik On

Vendors (in this case libc-implementors because malloc/new is usually really implemented by your standard library not your compiler provider) do various things, and usually you can expect some kind of small-object optimization and last allocation size caching at least up to certain size.

For example here are my timings with libc for mallo-free with glibc on Linux x86_64 and power-of-two sizes:

15.61 ns(R)     15.56 ns(U)     0.04 ns(S)      (2050002iters; 32 ms)   (malloc_free_)(&($_sz){1ULL<<3})
16.29 ns(R)     16.27 ns(U)     0.01 ns(S)      (1964070iters; 32 ms)   (malloc_free_)(&($_sz){1ULL<<4})
16.31 ns(R)     16.29 ns(U)     0.00 ns(S)      (1962244iters; 32 ms)   (malloc_free_)(&($_sz){1ULL<<5})
18.17 ns(R)     18.15 ns(U)     0.00 ns(S)      (1761118iters; 32 ms)   (malloc_free_)(&($_sz){1ULL<<6})
16.42 ns(R)     16.41 ns(U)     0.00 ns(S)      (1949061iters; 32 ms)   (malloc_free_)(&($_sz){1ULL<<7})
15.97 ns(R)     15.96 ns(U)     0.00 ns(S)      (2003412iters; 32 ms)   (malloc_free_)(&($_sz){1ULL<<8})
16.14 ns(R)     16.14 ns(U)     0.00 ns(S)      (1982292iters; 32 ms)   (malloc_free_)(&($_sz){1ULL<<9})
16.80 ns(R)     16.79 ns(U)     0.00 ns(S)      (1905223iters; 32 ms)   (malloc_free_)(&($_sz){1ULL<<10})
42.19 ns(R)     42.17 ns(U)     0.00 ns(S)      (758535iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<11})
42.90 ns(R)     42.88 ns(U)     0.00 ns(S)      (746074iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<12})
42.85 ns(R)     42.84 ns(U)     0.00 ns(S)      (746926iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<13})
42.32 ns(R)     42.18 ns(U)     0.00 ns(S)      (756378iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<14})
42.59 ns(R)     42.55 ns(U)     0.00 ns(S)      (751520iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<15})
41.98 ns(R)     41.97 ns(U)     0.00 ns(S)      (762451iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<16})
42.74 ns(R)     42.72 ns(U)     0.00 ns(S)      (748953iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<17})
42.32 ns(R)     42.31 ns(U)     0.00 ns(S)      (756267iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<18})
41.99 ns(R)     41.98 ns(U)     0.00 ns(S)      (762255iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<19})
42.31 ns(R)     42.30 ns(U)     0.00 ns(S)      (756442iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<20})
51.03 ns(R)     50.17 ns(U)     0.00 ns(S)      (627259iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<21})
44.93 ns(R)     44.91 ns(U)     0.00 ns(S)      (712362iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<23})
6674.43 ns(R)   677.29 ns(U)    5813.42 ns(S)   (4797iters; 32 ms)      (malloc_free_)(&($_sz){1ULL<<25})

and the same thing with musl-libc:

64.09 ns(R)     64.07 ns(U)     0.00 ns(S)      (499411iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<3})
61.49 ns(R)     61.47 ns(U)     0.00 ns(S)      (520542iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<4})
62.67 ns(R)     62.64 ns(U)     0.00 ns(S)      (510794iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<5})
61.53 ns(R)     61.52 ns(U)     0.00 ns(S)      (520150iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<6})
61.49 ns(R)     61.47 ns(U)     0.00 ns(S)      (520514iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<7})
62.78 ns(R)     62.66 ns(U)     0.00 ns(S)      (509871iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<8})
61.38 ns(R)     61.36 ns(U)     0.00 ns(S)      (521468iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<9})
79.97 ns(R)     79.94 ns(U)     0.00 ns(S)      (400374iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<10})
68.77 ns(R)     68.72 ns(U)     0.00 ns(S)      (465530iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<11})
68.21 ns(R)     68.18 ns(U)     0.00 ns(S)      (469345iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<12})
76.55 ns(R)     76.39 ns(U)     0.00 ns(S)      (418194iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<13})
74.67 ns(R)     74.63 ns(U)     0.00 ns(S)      (428704iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<14})
63.95 ns(R)     63.94 ns(U)     0.00 ns(S)      (500507iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<15})
66.33 ns(R)     66.31 ns(U)     0.00 ns(S)      (482528iters; 32 ms)    (malloc_free_)(&($_sz){1ULL<<16})
2629.39 ns(R)   653.03 ns(U)    1975.27 ns(S)   (12174iters; 32 ms)     (malloc_free_)(&($_sz){1ULL<<17})
5776.54 ns(R)   0.00 ns(U)      5474.55 ns(S)   (5542iters; 32 ms)      (malloc_free_)(&($_sz){1ULL<<18})
6198.86 ns(R)   708.22 ns(U)    4847.24 ns(S)   (5165iters; 32 ms)      (malloc_free_)(&($_sz){1ULL<<19})
6173.01 ns(R)   379.67 ns(U)    5279.59 ns(S)   (5186iters; 32 ms)      (malloc_free_)(&($_sz){1ULL<<20})
7029.97 ns(R)   0.00 ns(U)      6224.80 ns(S)   (4555iters; 32 ms)      (malloc_free_)(&($_sz){1ULL<<21})
7050.58 ns(R)   1589.07 ns(U)   4757.32 ns(S)   (4541iters; 32 ms)      (malloc_free_)(&($_sz){1ULL<<23})
7584.38 ns(R)   0.00 ns(U)      6807.15 ns(S)   (4221iters; 32 ms)      (malloc_free_)(&($_sz){1ULL<<25})

As you can see, small sizes are optimized for to a certain degree, but differently on different implementations (multithreaded malloc-free varies quite a bit too).

If you really care about some specific small-sized allocations, it's probably best to use your own free list allocator and get the best speed / memory usage regardless of your backend.