I was trying out the one of the programming pearls:
Given a file containing at most ten million 7-digit integers with no duplicates. What is an efficient way to print these numbers in ascending order using just 1.5Mb RAM and reading the data just once? What are the consequences of only having 1Mb of RAM and no other storage? How would your answer change if duplicates were permitted?
In order to create a test case I, generated 8999999 numbers and wrote them to a file. Then for each line, i started inserting the same to a tree, finally creating a trie structure.
Sample Code:
from sys import getsizeof
tree = dict()
xtree = dict()
f = open("data2.txt", "r")
cnt = 0
for number in f:
cnt += 1
currTree = tree
xtree[number] = dict()
for n in number.strip():
if n not in currTree:
currTree[n] = dict()
currTree = currTree[n]
f.close()
print(cnt)
print(getsizeof(tree))
print(getsizeof(xtree))
print(tree)
sample file data2.txt has 20 records
The tree generated is
Now the question is that when i do a memory sizing of the tree that is built, at 20 lines it shows a memory foot print of 240 bytes
At 100 line, size of tree becomes 368 bytes
and at 8999999 lines also it gives 368 bytes
I built an auxiliary map named xtree
which just feeds in the data
The sizes for xtree and tree are in bytes.
can anyone please explain how this is so..??
Your
tree
is just a dict with up to 10 key-value pairs. In bigger tree, there aren't any more key-value pairs. There are more values inside the values inside the … inside the key-value pairs, but there's still only 10 key-value pairs in the dict. And a dict with around 10 key-value pairs taking 368 bytes seems like about what you should expect.1As the docs for
getsizeof
say:…
Since you don't actually have a completely arbitrary data structure, but just a dict of dicts of etc. And, while you do have some shared references (e.g., if you read the number
1234567
while you already have an int with the same value in memory, Python will just reuse the same object), if you're trying to verify that you can fit into 1.5MB, you really want a worst-case measurement, so you probably want to skip the check for already-seen values.So, you can write something simpler instead of using that recipe if you want. But the idea will be the same:
Your
xtree
, on the other hand, is a dict with 8999999 key-value pairs. Doing the same back-of-the-envelope calculation, I'd expect that to be a bit under 300MB. Instead, it's a bit over 300MB. Close enough.And you're also storing the 8999999 7-digit integers on the heap. To take some nice round numbers, let's say there are 5M distinct integers that don't fall into the handful of small values pre-created and cached by CPython. Each of those integers is small enough to fit into one 30-bit digit, so they take 28 bytes apiece on 64-bit CPython. So, that's another 140MB not accounted for in
sys.getsizeof(xtree)
(but they are accounted for—in fact, over-accounted, with the worst-case-measuring implementation given) if you call the recursive function above on eithertree
orxtree
.So, your total memory use between
tree
,xtree
, and the actual integers is probably somewhere on the order of 750MB, which doesn't quite fit the< 1.5MB
requirement.1. Every Python object has some fixed header overhead, for things like the refcount, the pointer to the type, etc., plus type-specific things, like the length for most container types. Call that 64 bytes. A dict then has a hash table. It needs to be a bit bigger than 10 slots, to keep the load well below 1.0; call it 13 slots. Each slot needs a hash value, a reference to the key, and a reference to the value, so that's 3 pointers, or 24 bytes. 64 + 13 * 24 = 376. So that back-of-the-envelope calculation is off by only 8 bytes…