Important: Scroll down to the "final update" before you invest too much time here. Turns out the main lesson is to beware of the side effects of other tests in your unittest suite, and to always reproduce things in isolation before jumping to conclusions!
On the face of it, the following 64-bit code allocates (and accesses) one-mega 4k pages using VirtualAlloc (a total of 4GByte):
const size_t N=4; // Tests with this many Gigabytes
const size_t pagesize4k=4096;
const size_t npages=(N<<30)/pagesize4k;
BOOST_AUTO_TEST_CASE(test_VirtualAlloc) {
std::vector<void*> pages(npages,0);
for (size_t i=0;i<pages.size();++i) {
pages[i]=VirtualAlloc(0,pagesize4k,MEM_RESERVE|MEM_COMMIT,PAGE_READWRITE);
*reinterpret_cast<char*>(pages[i])=1;
}
// Check all allocs succeeded
BOOST_CHECK(std::find(pages.begin(),pages.end(),nullptr)==pages.end());
// Free what we allocated
bool trouble=false;
for (size_t i=0;i<pages.size();++i) {
const BOOL err=VirtualFree(pages[i],0,MEM_RELEASE);
if (err==0) trouble=true;
}
BOOST_CHECK(!trouble);
}
However, while executing it grows the "Working Set" reported in Windows Task Manager (and confirmed by the value "sticking" in the "Peak Working Set" column) from a baseline ~200,000K (~200MByte) to over 6,000,000 or 7,000,000K (tested on 64bit Windows7, and also on ESX-virtualized 64bit Server 2003 and Server 2008; unfortunately I didn't take note of which systems the various numbers observed occurred on).
Another very similar test case in the same unittest executable tests one-mega 4k mallocs (followed by frees) and that only expands by around the expected 4GByte when running.
I don't get it: does VirtualAlloc have some quite high per-alloc overhead? It's clearly a significant fraction of the page size if so; why is so much extra needed and what's it for? Or am I misunderstanding what the "Working Set" reported actually means? What's going on here?
Update: With reference to Hans' answer, I note this fails with an access violation in the second page access, so whatever is going on isn't as simple as the allocation being rounded up to the 64K "granularity".
char*const ptr = reinterpret_cast<char*>(
VirtualAlloc(0, 4096, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE)
);
ptr[0] = 1;
ptr[4096] = 1;
Update: Now on an AWS/EC2 Windows2008 R2 instance, with VisualStudioExpress2013 installed, I can't reproduce the problem with this minimal code (compiled 64bit), which tops out with an apparently overhead-free peak working set of 4,335,816K, which is the sort of number I'd expected to see originally. So either there is something different about the other machines I'm running on, or the boost-test based exe used in the previous testing. Bizzaro, to be continued...
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <vector>
int main(int, char**) {
const size_t N = 4;
const size_t pagesize4k = 4096;
const size_t npages = (N << 30) / pagesize4k;
std::vector<void*> pages(npages, 0);
for (size_t i = 0; i < pages.size(); ++i) {
pages[i] = VirtualAlloc(0, pagesize4k, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
*reinterpret_cast<char*>(pages[i]) = 1;
}
Sleep(5000);
for (size_t i = 0; i < pages.size(); ++i) {
VirtualFree(pages[i], 0, MEM_RELEASE);
}
return 0;
}
Final update: Apologies! I'd delete this question if I could because it turns out the observed problems were entirely due to an immediately preceeding unittest in the test suite which used TBB's "scalable allocator" to allocate/deallocate a couple of GByte of stuff. It seems scalable allocator actually retains such allocations in it's own pool rather than returning them to the system (see e.g here or here). Became obvious once I ran tests individually with enough of a Sleep
after them to observe their on-completion working set in task manager (whether anything can be done about the TBB behaviour might be an interesting question, but as-is the question here is a red-herring).
It turns out the observed problems were entirely due to an immediately preceding unittest in the test suite which used TBB's "scalable allocator" to allocate/deallocate a couple of GByte of stuff. It seems scalable allocator actually retains such allocations in it's own pool rather than returning them to the system (see e.g here or here). Became obvious once I ran tests individually with enough of a
Sleep
after them to observe their on-completion working set in task manager.