Find integers in range which are not in a set

130 views Asked by At

I need to find all IDs that are not in my table, between the min and max ID. Unfortunately MySQL has no simple way of generating sequences, so I thought it will be easier to do in the application.

IDs in the table are from an external source, and probably around 80% of the min-max range are empty spaces (around 40mln rows in 200mln range). As a primary key, they are already fetched sorted.

The problem is simply C = A\B, but the set A is somewhat special so I thought of a few ways to implement that:

Generate a std::set A of all integers between min and max ID and then remove from it the values in set/vector B. However I'm not sure if std::set will work well at this memory size and with this many removals...

Populate a std::vector with std::set_difference of vector A and vector B. However, this will almost double the memory requirements (because of B being several times smaller in cardinality than A).

As previous, but replace the vector A with std::ranges::views::iota?

Some custom algorithm, iterating on a range A, checking if not exists in B and pushing to C.

Any other ideas? What is the best choice (and data structure) for this problem? Ideally minimizing both memory and runtime, or at least with small tradeoffs. Should I pre-allocate (reserve) in any of these solutions?

4

There are 4 answers

6
Arkanil Paul On BEST ANSWER

C++ solution:

You can use the following function -

#include <vector>
#include <set>

std::vector<int> findMissingRowIDs(int minID, int maxID, const std::set<int>& B /* B is the set of RowIDs present in the table*/)
{
    std::vector<int> C; // Missing RowIDs
    std::set<int>::iterator it = B.begin();
    
    while(*it<minID) it++;
    
    int i=minID; // assuming min is inclusive
    
    while(i<=maxID) // assuming max is inclusive
    {
        if (it==B.end()) // for i > (last element of B)
        {
            C.push_back(i);
            i++;
        }
        else if (i<*it) // checks if i is not in B
        {
            C.push_back(i);
            i++;
        }
        else
        {
            it++;
            i++;
        }
    }
    
    return C;
}

The time complexity of the above function is O(maxID-minID) and space complexity is O(maxID-minID).

MySQL solution:

You can also do it in MySQL, like this -

SELECT @minID := 1, @maxID := 15;

WITH RECURSIVE N AS (SELECT @minID AS i UNION ALL SELECT i + 1 FROM N WHERE i <= @maxID)
  SELECT i FROM N WHERE i NOT IN(SELECT rowID FROM TABLE_NAME);
0
green-21 On

If ID is sorted integer, I think the best practice is calculate length between exist id and id.

result C is vector<pair<int, int>>. first is start ID value and second is lengh.

for example,

range [0, 15] and B = {1, 3, 9, 13}

C = 
{0, 1} -> 0
{2, 1} -> 2
{4, 5} -> 4, 5, 6, 7, 8
{10, 3} -> 10, 11, 12
{14, 2} -> 14, 15

empty space is {0,2,4,5,6,7,8,10,11,12,14,15}


#include <iostream>
#include <vector>

using namespace std;

void FindMissingValue(int minV, int maxV, vector<int> &B, vector<pair<int, int>> &C) {
    if(B[0] > minV)  C.push_back({minV, B[0]-minV});

    int i;
    for(i=0; i+1<B.size(); i++) {
        const int diff = B[i+1] - B[i];
        if(diff <= 1) continue;
        C.push_back({B[i]+1, diff-1});
    }

    if(B[i-1] < maxV) {
        C.push_back({B[i-1]+1, maxV-B[i-1]});
    }
}

void PrintMissingValue(const vector<pair<int,int>> &C) {
    for(auto [start, leng] : C) {
        for(int i=0; i<leng; i++) {
            cout << start + i << endl;
        }
    }
}
0
Costantino Grana On

Am I missing something? You ask:

I need to find all IDs that are not in my table, between the min and max ID. IDs in the table are from an external source [...] As a primary key, they are already fetched sorted.

So, if you consider:

B       = 3 8 10 11 13 19 20
beginID = 1  (included)
endID   = 30 (excluded)

You should get these lists of missing values:

[1,3)
[4,8)
[9,10)
[12,13)
[14,19)
[21,30)

It's just a metter of going through the list of IDs and getting these ranges:

#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
#include <format>
#include <ranges>

auto missing(const std::vector<int>& B, int beginID, int endID)
{
    std::vector<std::pair<int, int>> res;
    int cur = beginID;
    for (const auto& x : B) {
        if (cur < x) {
            res.emplace_back(cur, x);
        }
        cur = x + 1;
    }
    if (cur < endID) {
        res.emplace_back(cur, endID);
    }
    return res; 
}

int main() 
{
    std::vector<int> v{ 3, 8, 10, 11, 13, 19, 20 };
    auto m = missing(v, 1, 30);
    for (const auto& x : m) {
        std::cout << std::format("[{},{})\n", x.first, x.second);
    }
    std::cout << "\nmissing: ";
    for (const auto& [b, e] : m) {
        for (int i : std::ranges::iota_view{b, e})
            std::cout << i << " ";
    }
    std::cout << "\n";
    return 0;
}

EDIT: I just realized that green-21 posted nearly the same solution! Only difference is that he uses (start, length) instead of (begin, end). I also assume that beginID is lower than the first element of B and of endID.

Solution on godbolt.org

0
herhor67 On

I found out there were other constraints (like I had to skip some gaps that were intentional) so I ended up modifying one of the codes from @NathanOliver's link, which returned me beginnings and ends of gaps, already with proper min and max. Then I just had to iterate over each gap and add its IDs to the total list.

As far as the original question goes, @ArkanilPaul's answer does it probably in the best way, avoiding any unnecessary allocations. Probably very similar to std::set_difference with std::ranges::views::iota as set A.

The other two answers however provide a more clear code, to first compute gaps between existing IDs and then use them to generate final list.