Can you get a uniformly-random key-value pair from an std::map in sub-linear time?

194 views Asked by At

Given an std::map<K,V> of size n, can you randomly get a key (and value) in the container, portably? Randomly here refers to key's position in the container. Let's say the distribution has to be only close to uniform.

My thinking was that standard library maps are often (always?) implemented as red-black trees, which are decently balanced. So locating an (almost-)uniformly random value could be done in log-time if you can access this underlying structure. It wouldn't be trivial but it could be done (idea: each node can estimate roughly how many child-nodes are left and how many are right of it).

In general one cannot access that information, so the obvious solution is to linearly iterate over it until you find the ith element for some uniformly-random chosen 0 <= i < n. However, as of C++17 there is also the rudimentary node interface that allows one to extract and re-insert nodes. But after some thought I don't see how this would be helpful.

Is there something I am overlooking or is this truly impossible in sub-linear time given the standardised interface of std::map?


Note: Obviously, if one wants to get more than, say, n / log n elements, shuffling a vector of map iterators is likely the best option.

2

There are 2 answers

0
n. m. could be an AI On

You know exactly how many children the root has. From this you can roughly estimate how many children either of its children have, and more generally, how many children any node has, depending on its height.

From this data it is possible to generate a path in the tree that leads to a roughly uniformly selected node.

In order to follow the path, we need further assumptions: that the search starts from the root; and that the decision to go left, go right, or return the current node is made after calling the comparison function once or twice. It is easy to test that these assumptions hold by checking them from within a custom comparator. So a custom comparator can be made that leads the search along the predetermined path.

(This uses ideas from a now-deleted answer)

0
bitmask On

Note: This answer uses ideas from two other answers under this question and builds on them. Unfortunately the original answer suggesting the key idea has been deleted by its poster. Thanks to both for your contributions.


I have not been able to get a O(1)-space / O(log n)-space algorithm to work, but it works rather decently with O(log n) space.

tl;dr The basic structure of the algorithm is such that a special type is supplied to std::map's find member function which guides the search down the internal tree by randomly returning true/false. But there are a number of caveats.

Issues with std::map

I discovered that, depending on the insert order, gcc's standard library implementation can produce quite unbalanced underlying trees, more than I intuited. I've seen ratios of 3:7 for the root node. If one inserts shuffled input the results are better, unsurprisingly.

Random walks

The idea to use an std::map<K,V,std::less<>> and produce a random walk down the tree worked, but in order to guarantee a fair chance of tree-nodes to be selected one has to select each parent node with half the chance as their respective children. However, when starting at the root, one does not know how deep the random walk will end up being, so there is no way to know what the chance should be to stop the descend and return the current node.

Traces / Witnesses

So, instead, we do a random walk down the tree by randomly returning true / false with probability 0.5 to 0.5 in our trojan object, but we record all O(log n) values encountered. This trace is then later evaluated by a wrapper function. Its job is to start at the end of the trace and give each value a 0.5 chance to be either returned or popped. By doing this, we ensure that values close to the root have an increasingly smaller (but not 0) chance of being selected. At the same time, all leaf-nodes have a 0.5 chance to be selected from the given trace.

Uniformity

For a perfectly balanced tree, each leaf node would have the same chance to be selected. And each tree node has half the chance to be selected in a given trace than their respective children, but will be present in double the number of traces, making their chance the same as their children nodes. By induction, any tree node up to and including the root will have the same probability as any leaf node to be selected.

Since the actual tree is not perfectly balanced, we don't get a perfect uniform distribution. But we get as close to it as I think we can in O(log n) time.

Code

Here is the gist

template <typename K>
class Sentinel {    
  private:
    std::mt19937& mt;
    std::uniform_int_distribution<std::uint32_t>& bdis;

  public:
    static std::vector<K const*> trace;
    // not thread safe to avoid reallocation for several calls
    // the `static` could of course be removed or made thread_local

    Sentinel(std::mt19937& mt, std::uniform_int_distribution<std::uint32_t>& bdis) : mt{mt}, bdis{bdis} {
      trace.clear();
    } 

    friend bool operator<(K const& k, Sentinel const& s) {
      trace.push_back(&k);
      return s.bdis(s.mt) == 1u; 
    }

    // analogous for s < k
};

template <typename K, typename V>
auto random(std::map<K,V,std::less<>> const& map, std::mt19937& mt) {
  if (map.empty()) return map.end();
  std::uniform_int_distribution<std::uint32_t> bdis(0,1);
  Sentinel<K> s(mt,bdis);

  map.find(s); // <- this creates the trace through the tree but will often return end().

  while (s.trace.size() > 1) {
    if (s.trace[s.trace.size()-1] == s.trace[s.trace.size()-2]) {
      // some implementations may check k < s and s < k,
      // so remove double entries
      s.trace.pop_back();
      continue;
    }   
    if (bdis(mt) == 1u) { // .5 chance to return current entry
      return map.find(*s.trace.back());
    }   
    s.trace.pop_back();
  }

  // the extra find is only necessary if one wants the iterator
  return map.find(*s.trace.back());
}

Usage:

std::mt19937 mt{42};
std::map<unsigned,unsigned,std::less<>> map;
// ... fill map ...
auto const randomKey = random(map,mt)->first;

Live example

https://godbolt.org/z/xbxbch3EY

Short Discussion

As one can see from the output in the live demo, a lot of entries are between factor 0.7 and 2 of what their probability should be for a perfectly balanced tree. I've seen outliers of up to 0.2 or 5, but they seem to be rare. At least at a glance without a thorough statistical analysis, that is. If uniformity is not paramount I think this is a reasonable way of doing random selection from a map. But it needs to be a map with std::less<> instead of std::less<K>.

Yep, this was fun :)