Converting 3.5 million latitude and longitude to zip codes using Python in a reasonable amount of time

120 views Asked by At

I have the following dataframe df1 of latitudes and longitudes:

    start_lat   start_lng
0   40.757306   -73.974518
1   40.663657   -73.963014
2   40.663657   -73.963014
3   40.663657   -73.963014
4   40.663657   -73.963014
... ... ...
3580761 40.679739   -73.934619
3580762 40.755733   -73.923661
3580763 40.710425   -73.965394
3580764 40.763189   -73.978434
3580765 40.755293   -73.967641

I'm trying to add a column of zipcodes using geopy as:

import geopy

geolocator = geopy.Nominatim(user_agent="http")

def get_zip_code(x, lat, lng):
    location = geolocator.reverse("{}, {}".format(x[lat],x[lng]))
    try:
        return location.raw['address']['postcode']
    except Exception as e:
        print(e)

When I take the first 100 values, it takes about 50 seconds to execute.

df1_section = df1.iloc[:100]
df1_section['start_zipcode'] = df1_section.apply(lambda x: get_zip_code(x, 'start_lat', 'start_lng'), axis=1)
df1_section
    start_lat   start_lng   start_zipcode
0   40.757306   -73.974518  10037
1   40.663657   -73.963014  11225
2   40.663657   -73.963014  11225
3   40.663657   -73.963014  11225
4   40.663657   -73.963014  11225
... ... ... ...
95  40.753967   -73.998854  10019
96  40.705945   -74.013219  10006
97  40.708347   -74.017134  10280
98  40.705945   -74.013219  10006
99  40.731911   -74.008769  10014

I need to get zip codes for 3.5 million records, however. When I try computing the zip codes for the entire df1, I get a TimeOut error. Is there any way to do this process more quickly? I also have pairs of street names that correspond to each latitude and longitude, e.g., E 50 St & Park Ave, Flatbush Ave & Ocean Ave, Lewis Ave & Fulton St, etc. Would it be faster to somehow covert these pairs of street names to zip codes?

I keep getting SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead as well.

1

There are 1 answers

0
Acccumulation On

The examples you give have several duplicates. If that's reflective of your actual data, you can save a lot of time with memoization. The simplest way to do this is probably to get an dataframe of the unique lat/long pairs, apply your function to that, then merge that back to your original. If you then have another batch, you can use the results from the previous batch(es), and only apply your function to new pairs.

Looking at the documentation, I noticed that the default is for this to be synchronous. You can see whether asynchronous is faster.

If you're doing a lot of lookups, doing a web query for each one may not be the best way to do it; a local option may be faster. You could try asking the post office whether there are such options. @wjandrea mentioned in the comments that there's an offline option here https://nominatim.org/release-docs/develop/admin/Installation/ . While they said this takes several days to set up, 3.5 million at 500 ms each would be about a month, so this may still be a better option.