Optimization Strategy for Data Retrieval from APIs with Row Limits

Question

Optimization Strategy for Data Retrieval from APIs with Row Limits

64 views Asked by Ruben Selander At 02 February 2024 at 15:19

I'm currently working on a data scraping project where I need to retrieve large amounts of data from several statistical APIs. These APIs, particularly older and government ones, often do not support bulk downloads for entire datasets. Instead, they require specifying the variables and corresponding values you're interested in for each request. Additionally, many impose a limit on the number of values (or "rows") you can retrieve in a single request. This has led me to seek an optimal algorithm for constructing requests under these constraints.

Consider a dataset titled "Population by year, sex, and country", with variables and their possible values as follows:

Example dataset variables

{
   "sex": ["total", "women", "men"],
   "country of birth": ["Norway", "Finland", "Sweden", "Denmark"],
   "year": ["2019", "2020", "2021", "2022", "2023"]
}

Each request must include at least one choice per variable, and the number of cells returned is the product of the choices for each variable.

Example request params

{
  "sex": ["total", "women", "men"],
  "country of birth": ["sweden", "denmark"],
  "year": ["2023"]
}

This request would return 3 x 2 x 1 = 6 rows, covering each combination of the specified values:

sex, country of birth, year, value
total, sweden, 2023, X
total, denmark, 2023, X
women, sweden, 2023, X
women, denmark, 2023, X
men, denmark, 2023, X
men, sweden, 2023, X

For this example, the API limits the number of values/rows of data you can retrieve to 13 per request. row_limit = 13

It's clear that if we want to get ALL the values from this table, with as few requests as possible, the given example is not an optimal one as it only retrieves 6 rows when the limit is 13. What is not clear is how do we go about constructing API requests (list of dicts) so that we don't have to perform more API calls then neccesary?

The total number of combinations/values in a dataset, representing all possible rows we aim to retrieve, can be calculated by multiplying together the number of options available for each variable listed in the dictionary. In the case of our example variables:

total_combinations = 3 x 4 x 5 = 60

One thing we can note is that a lower bound for the minimum number of requests needed to retrieve all data combinations from the API, given a limit on how many rows can be fetched per request, is euqal to the total number of rows (total_combinations) divided by the API row request limit (row_limit). In our example:

min_requests = total_combinations / row_limit = 60 / 13 = 4.615.. ≈ 5

This is just a lower bound and while the actual minimum number of requests may be larger.

Objective: Develop a function that creates an optimal list of dictionaries for request configurations. Each dictionary should not result in more than the maximum allowed rows (e.g., 13). The list of dictionaries should cover all possible combinations without overlaps, and minimize the total number of dictionaries (hence, requests).

Constraints:

Each dictionary results in no more than the given limit of rows/combinations.
The list of dictionaries collectively covers all possible rows/combinations.
There are no duplicate rows/combinations across dictionaries.
The total number of dictionaries is minimized while adhering to the above constraints.

I am looking for advice on algorithms or strategies that could efficiently solve this problem, taking into consideration the need to minimize the number of API requests while adhering to the specified limitations. Examples of similar implementations or pointers to relevant algorithms would be greatly appreciated.

Original Q&A

There are 1 answers

**Ruben Selander** · Answer 1 · 2024-02-05T01:43:34+00:00

Like I formulated above with poor language, it's all about partitioning. The problem is reminiscent of the classic kapsack one but a lot more complex since the weights and values are non static. Since we require all data/row combinations it means that each element in the every input list must be included. So the problem comes down to, How many ways are there to divy up len(variable_list) values? Quite a lot, but if we think about the the only thing we need to consider is the actual lengths of the input variables, since we are going to try to split them up anyway. So instead of considering ALL possible configurations we only consider the number of ways to eveningly (with a potential exception for the last batch) divide up a list of length X? Turns out its at most X different ways. Then we look at, if we divide up the list into Y (Y<X) different, even parts (except potentially the last) of length "size" how many batches/parts will that give?. That number is given with number_of_batches = math.ceil(len(variable_values)/size). Why do we care how many ways we split one of the list? Because each individual value in an input list must be matched with ALL other values from the other values in some request if we want to retrieve all dataset rows. Sooo that means that the number of request we end up with (while guareting that we generate all possible combinations) is simply the product of number_of_batches for variable 1 * number_of_batches for variable 2 * number_of_batches for variable 3.... And given these partitions we can also calculate the number of rows/combinations a request with these specific number of partitions will result in, which is the product of the corresponding "size" value (used for calculating number of batches) for each variable. Its then a matter of simply finding the partitioning configuration which results in the fewest requests above the lower bound while still resulting in a combination/row count less than the limit. The provided solution has room for optimization but tested with the largest of the 50 000 tables I need to scrape it returned the optimal solution for request configuration in around 0.18 s which is good enough for my extreme cases.

from itertools import product as iter_product
import math

def get_partition_data(variables, limit):
    """Get the number of batches and batch sizes for each variable."""
    batch_size_sets = {}
    number_of_batches_lists = []
    for var, values in variables.items():
        size_to_batches = {}

        for size in range(1, min(len(values) + 1, limit + 1)):
            nbr_of_batches = math.ceil(len(values) / size)
            # This ensures we only keep the largest size for each unique number of batches
            size_to_batches[nbr_of_batches] = size
        batch_size_sets[var] = size_to_batches
        number_of_batches_lists.append(list(size_to_batches.keys()))

    return batch_size_sets, number_of_batches_lists


def find_optimal_combination(variables, limit):
    """Optimized function to find the optimal batch sizes for each variable."""
    total_rows = math.prod([len(values) for values in variables.values()])
    lower_request_bound = math.ceil(total_rows / limit)
    if lower_request_bound == 1:
        return {var: len(values) for var, values in variables.items()}

    batch_size_sets, number_of_batches_lists = get_partition_data(variables, limit)

    best_combination = None
    min_request_count = float("inf")

    for combo in iter_product(*number_of_batches_lists):
        request_count = math.prod(combo)
        if request_count >= lower_request_bound and request_count < min_request_count:
            batch_sizes_product = math.prod(
                [batch_size_sets[var][nbr] for var, nbr in zip(variables.keys(), combo)]
            )
            if batch_sizes_product <= limit:
                min_request_count = request_count
                best_combination = combo
                if min_request_count == lower_request_bound:
                    break
    return {
        var: batch_size_sets[var][nbr]
        for var, nbr in zip(variables.keys(), best_combination)
    }



def generate_all_combinations(variables, optimal_batch_sizes):
    """Generate all combinations of batches, one batch from each variable's batched lists of values."""
    # Split each variable's values into batches according to the optimal batch size
    all_batches = {
        var: split_into_batches(values, optimal_batch_sizes[var])
        for var, values in variables.items()
    }
    # Generate the Cartesian product of all batches to form the configurations
    configurations = list(iter_product(*all_batches.values()))

    # Convert tuple configurations to dictionary format
    configurations_dicts = []
    for configuration in configurations:
        config_dict = {
            var: batch for var, batch in zip(variables.keys(), configuration)
        }
        configurations_dicts.append(config_dict)

    return configurations_dicts


def get_request_configs(variables, limit):
    """Get the optimal batch sizes and all possible combinations of batches."""
    optimal_batch_sizes = find_optimal_combination(variables, limit)
    request_configs = generate_all_combinations(variables, optimal_batch_sizes)
    return optimal_batch_sizes, request_configs

Print out

Input variable lengths:
{
    "commodity group according to CN": 25163,
    "trading partner": 263,
    "observations": 3,
    "month": 347
}
Limit: 100000

Optimal batch sizes: 
{
    "commodity group according to CN": 95,
    "trading partner": 263,
    "observations": 1,
    "month": 4
}

Elapsed time: 0.150539 seconds
Number of request configurations: 69165
Lower request bound: 68893
Difference: 272
Total number of rows: 6889201629

TechQA.

Optimization Strategy for Data Retrieval from APIs with Row Limits

There are 1 answers

Related Questions in PYTHON

Related Questions in ALGORITHM

Related Questions in STATISTICS

Related Questions in COMBINATIONS

Related Questions in COMBINATORICS

Popular Questions

Trending Questions