I have a dataframe with two columns like these one:
country_code geo_coords
GB nan
nan [13.43, 52.48]
TR nan
...
I want to fill the nan values in the country_code using the information from the geo_coords column.
To extract the country code from the coordinates I am using reverse_geocoder module.
This is my code:
def from_coords_to_code(coords):
"""Find the country code of coordinates.
Args:
coords: coordinates of the point in [lon, lat] format
"""
return rg.search(coords[::-1])[0]["cc"]
sub_df["country_code"].fillna(sub_df["geo_coords"], inplace=True)
sub_df["country_code"] = sub_df["country_code"].apply(
lambda x: from_coords_to_code(x) if isinstance(x, list) else x
)
As I have thousands and thousands of rows, this code is extremely slow.
Following this other question I was trying to apply the reverse geocoding to the whole geo_coords column after removing the nan values:
geo_coords = df["geo_coords"].loc[df["geo_coords"].notna()]
geo_coords_tuple = tuple(geo_coords.apply(lambda x: tuple(x[::-1])))
cc_new = rg.search(geo_coords_tuple, mode=2)
country_code = [i["cc"] for i in cc_new]
for i, j in enumerate(geo_coords.index):
df["country_code"].iloc[j] = country_code[i]
In this way it's faster, but it gives me the warning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
sub_df["country_code"].iloc[j] = country_code[i]
which I would like to avoid and I am not sure this is a optimal solution.
Any suggestion to make the whole code more efficient?
I am happy to move from "reverse_geocoder" to any other module.
IMPORTAT: the coordinates in geo_coords are in the geoJSON format, i.e. [lon, lat], this it the reason I invert them.
The function
rg.search()is very slow and in addition already uses multiple cores. I was able to speed the searching a little bit to add additional worker to the task, usingProcessPoolExecutor, e.g.:On my computer (AMD 5700x) this is doing ~17 searches per second.