Pandas speedup reverse_geocoder in column with string and coordinates

62 views Asked by At

I have a dataframe with two columns like these one:

country_code    geo_coords
GB              nan
nan             [13.43, 52.48]
TR              nan
...

I want to fill the nan values in the country_code using the information from the geo_coords column.

To extract the country code from the coordinates I am using reverse_geocoder module.

This is my code:

def from_coords_to_code(coords):
    """Find the country code of coordinates.

    Args:
        coords: coordinates of the point in [lon, lat] format
    """
    return rg.search(coords[::-1])[0]["cc"]


sub_df["country_code"].fillna(sub_df["geo_coords"], inplace=True)

sub_df["country_code"] = sub_df["country_code"].apply(
    lambda x: from_coords_to_code(x) if isinstance(x, list) else x
)

As I have thousands and thousands of rows, this code is extremely slow.

Following this other question I was trying to apply the reverse geocoding to the whole geo_coords column after removing the nan values:

geo_coords = df["geo_coords"].loc[df["geo_coords"].notna()]
geo_coords_tuple = tuple(geo_coords.apply(lambda x: tuple(x[::-1])))
cc_new = rg.search(geo_coords_tuple, mode=2)
country_code = [i["cc"] for i in cc_new]

for i, j in enumerate(geo_coords.index):
    df["country_code"].iloc[j] = country_code[i]

In this way it's faster, but it gives me the warning:

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sub_df["country_code"].iloc[j] = country_code[i]

which I would like to avoid and I am not sure this is a optimal solution.

Any suggestion to make the whole code more efficient?

I am happy to move from "reverse_geocoder" to any other module.

IMPORTAT: the coordinates in geo_coords are in the geoJSON format, i.e. [lon, lat], this it the reason I invert them.

1

There are 1 answers

0
Andrej Kesely On

The function rg.search() is very slow and in addition already uses multiple cores. I was able to speed the searching a little bit to add additional worker to the task, using ProcessPoolExecutor, e.g.:

from concurrent.futures import ProcessPoolExecutor as Pool

import pandas as pd
import reverse_geocoder as rg
from tqdm import tqdm


def process_coord(tpl):
    idx, (a, b) = tpl
    return idx, rg.search((b, a))[0]["cc"]


if __name__ == "__main__":
    # sample dataframe:
    df = pd.DataFrame(
        {
            "country_code": ["GB", None, "TR"] * 10_000,
            "geo_coords": [None, [13.43, 52.48], None] * 10_000,
        }
    )

    with Pool(max_workers=2) as pool:
        mask = df["country_code"].isna()

        for i, result in tqdm(
            pool.map(process_coord, zip(df.index[mask], df.loc[mask, "geo_coords"])),
            total=mask.sum(),
        ):
            df.loc[i, "country_code"] = result

    print(df)

On my computer (AMD 5700x) this is doing ~17 searches per second.

  5%|████████▌                      | 507/10000 [00:29<09:12, 17.19it/s]