I need to partition a large set of 3D points (using C++). The points are stored on the HDD as binary float array, and the files are usually larger than 10GB. I need to divide the set into smaller subsets that have a size less than 1GB. The points in the subset should still have the same neighborhood because I need to perform certain algorithms on the data (e.g., object detection).
I thought I could use a KD-Tree. But how can I construct the KD-Tree efficiently if I can't load all the points into RAM? Maybe I could map the file as virtual memory. Then I could save a pointer to each 3D point that belongs to a segment and store it in a node of the KD-Tree. Would that work? Any other ideas?
Thank you for your help. I hope you can unterstand the problem :D
You basically need an out-of-core algorithm for computing (approximate) medians. Given a large file, find its median and then partition it into two smaller files. A k-d tree is the result of applying this process recursively along varying dimensions (and when the smaller files start to fit in memory, you don't have to bother with the out-of-core algorithm any more).
To approximate the median of a large file, use reservoir sampling to grab a large but in-memory sample, then run an in-core median finding algorithm. Alternatively, for an exact median, compute the (e.g.) approximate 45th and 55th percentiles, then make another pass to extract the data points between them and compute the median exactly (unless the sample was unusually non-random, in which case retry). Details are in the Motwani--Raghavan book on randomized algorithms.