C++ memory allocation mechanism performance comparison (tcmalloc vs. jemalloc)

25.7k views Asked by At

I have an application which allocates lots of memory and I am considering using a better memory allocation mechanism than malloc.

My main options are: jemalloc and tcmalloc. Is there any benefits in using any of them over the other?

There is a good comparison between some mechanisms (including the author's proprietary mechanism -- lockless) in http://locklessinc.com/benchmarks.shtml and it mentions some pros and cons of each of them.

Given that both of the mechanisms are active and constantly improving. Does anyone have any insight or experience about the relative performance of these two?

6

There are 6 answers

0
Matthieu M. On BEST ANSWER

If I remember correctly, the main difference was with multi-threaded projects.

Both libraries try to de-contention memory acquire by having threads pick the memory from different caches, but they have different strategies:

  • jemalloc (used by Facebook) maintains a cache per thread
  • tcmalloc (from Google) maintains a pool of caches, and threads develop a "natural" affinity for a cache, but may change

This led, once again if I remember correctly, to an important difference in term of thread management.

  • jemalloc is faster if threads are static, for example using pools
  • tcmalloc is faster when threads are created/destructed

There is also the problem that since jemalloc spin new caches to accommodate new thread ids, having a sudden spike of threads will leave you with (mostly) empty caches in the subsequent calm phase.

As a result, I would recommend tcmalloc in the general case, and reserve jemalloc for very specific usages (low variation on the number of threads during the lifetime of the application).

1
Alexey On

I have recently considered tcmalloc for a project at work. This is what I observed:

  • Greatly improved performance for heavy usage of malloc in a multithreaded setting. I used it with a tool at work and the performance improved almost twofold. The reason is that in this tool there were a few threads performing allocations of small objects in a critical loop. Using glibc, the performance suffers because of, I think, lock contentions between malloc/free calls in different threads.

  • Unfortunately, tcmalloc increases the memory footprint. The tool I mentioned above would consume two or three times more memory (as measured by the maximum resident set size). The increased footprint is a no go for us since we are actually looking for ways to reduce memory footprint.

In the end I have decided not to use tcmalloc and instead optimize the application code directly: this means removing the allocations from the inner loops to avoid the malloc/free lock contentions. (For the curious, using a form of compression rather than using memory pools.)

The lesson for you would be that you should carefully measure your application with typical workloads. If you can afford the additional memory usage, tcmalloc could be great for you. If not, tcmalloc is still useful to see what you would gain by avoiding the frequent calls to memory allocation across threads.

2
SunfiShie On
1
Martin On

Your post do not mention threading, but before considering mixing C and C++ allocation methods, I would investigate the concept of memory pool.BOOST has a good one.

2
Basile Starynkevitch On

You could also consider using Boehm conservative garbage collector. Basically, you replace every malloc in your source code with GC_malloc (etc...), and you don't bother calling free. Boehm's GC don't allocate memory more quickly than malloc (it is about the same, or can be 30% slower), but it has the advantage to deal with useless memory zones automatically, which might improve your program (and certainly eases coding, since you don't care any more about free). And Boehm's GC can also be used as a C++ allocator.

If you really think that malloc is too slow (but you should benchmark; most malloc-s take less than microsecond), and if you fully understand the allocating behavior of your program, you might replace some malloc-s with your special allocator (which could, for instance, get memory from the kernel in big chunks using mmap and manage memory by yourself). But I believe doing that is a pain. In C++ you have the allocator concept and std::allocator_traits, with most standard containers templates accepting such an allocator (see also std::allocator), e.g. the optional second template argument to std::vector, etc...

As others suggested, if you believe malloc is a bottleneck, you could allocate data in chunks (or using arenas), or just in an array.

Sometimes, implementing a specialized copying garbage collector (for some of your data) could help. Consider perhaps MPS.

But don't forget that premature optimization is evil and please benchmark & profile your application to understand exactly where time is lost.

8
rogerdpack On

Be aware that according to the 'nedmalloc' homepage, modern OS's allocators are actually pretty fast now:

"Windows 7, Linux 3.x, FreeBSD 8, Mac OS X 10.6 all contain state-of-the-art allocators and no third party allocator is likely to significantly improve on them in real world results"

http://www.nedprod.com/programs/portable/nedmalloc

So you might be able to get away with just recommending your users upgrade or something like it :)