I'm currently working on a data scraping project where I need to retrieve large amounts of data from several statistical APIs. These APIs, particularly older and government ones, often do not support bulk downloads for entire datasets. Instead, they require specifying the variables and corresponding values you're interested in for each request. Additionally, many impose a limit on the number of values (or "rows") you can retrieve in a single request. This has led me to seek an optimal algorithm for constructing requests under these constraints.
Consider a dataset titled "Population by year, sex, and country", with variables and their possible values as follows:
Example dataset variables
{
"sex": ["total", "women", "men"],
"country of birth": ["Norway", "Finland", "Sweden", "Denmark"],
"year": ["2019", "2020", "2021", "2022", "2023"]
}
Each request must include at least one choice per variable, and the number of cells returned is the product of the choices for each variable.
Example request params
{
"sex": ["total", "women", "men"],
"country of birth": ["sweden", "denmark"],
"year": ["2023"]
}
This request would return 3 x 2 x 1 = 6 rows, covering each combination of the specified values:
sex, country of birth, year, value
total, sweden, 2023, X
total, denmark, 2023, X
women, sweden, 2023, X
women, denmark, 2023, X
men, denmark, 2023, X
men, sweden, 2023, X
For this example, the API limits the number of values/rows of data you can retrieve to 13 per request. row_limit = 13
It's clear that if we want to get ALL the values from this table, with as few requests as possible, the given example is not an optimal one as it only retrieves 6 rows when the limit is 13. What is not clear is how do we go about constructing API requests (list of dicts) so that we don't have to perform more API calls then neccesary?
The total number of combinations/values in a dataset, representing all possible rows we aim to retrieve, can be calculated by multiplying together the number of options available for each variable listed in the dictionary. In the case of our example variables:
total_combinations = 3 x 4 x 5 = 60
One thing we can note is that a lower bound for the minimum number of requests needed to retrieve all data combinations from the API, given a limit on how many rows can be fetched per request, is euqal to the total number of rows (total_combinations) divided by the API row request limit (row_limit). In our example:
min_requests = total_combinations / row_limit = 60 / 13 = 4.615.. ≈ 5
This is just a lower bound and while the actual minimum number of requests may be larger.
Objective: Develop a function that creates an optimal list of dictionaries for request configurations. Each dictionary should not result in more than the maximum allowed rows (e.g., 13). The list of dictionaries should cover all possible combinations without overlaps, and minimize the total number of dictionaries (hence, requests).
Constraints:
- Each dictionary results in no more than the given limit of rows/combinations.
- The list of dictionaries collectively covers all possible rows/combinations.
- There are no duplicate rows/combinations across dictionaries.
- The total number of dictionaries is minimized while adhering to the above constraints.
I am looking for advice on algorithms or strategies that could efficiently solve this problem, taking into consideration the need to minimize the number of API requests while adhering to the specified limitations. Examples of similar implementations or pointers to relevant algorithms would be greatly appreciated.
Like I formulated above with poor language, it's all about partitioning. The problem is reminiscent of the classic kapsack one but a lot more complex since the weights and values are non static. Since we require all data/row combinations it means that each element in the every input list must be included. So the problem comes down to, How many ways are there to divy up len(variable_list) values? Quite a lot, but if we think about the the only thing we need to consider is the actual lengths of the input variables, since we are going to try to split them up anyway. So instead of considering ALL possible configurations we only consider the number of ways to eveningly (with a potential exception for the last batch) divide up a list of length X? Turns out its at most X different ways. Then we look at, if we divide up the list into Y (Y<X) different, even parts (except potentially the last) of length "size" how many batches/parts will that give?. That number is given with number_of_batches = math.ceil(len(variable_values)/size). Why do we care how many ways we split one of the list? Because each individual value in an input list must be matched with ALL other values from the other values in some request if we want to retrieve all dataset rows. Sooo that means that the number of request we end up with (while guareting that we generate all possible combinations) is simply the product of number_of_batches for variable 1 * number_of_batches for variable 2 * number_of_batches for variable 3.... And given these partitions we can also calculate the number of rows/combinations a request with these specific number of partitions will result in, which is the product of the corresponding "size" value (used for calculating number of batches) for each variable. Its then a matter of simply finding the partitioning configuration which results in the fewest requests above the lower bound while still resulting in a combination/row count less than the limit. The provided solution has room for optimization but tested with the largest of the 50 000 tables I need to scrape it returned the optimal solution for request configuration in around 0.18 s which is good enough for my extreme cases.
Print out