Parsing a GeoJSON file into dataframe produces single unwanted duplicate, unable to find cause

156 views Asked by At

I am working with Berlin postal code data obtained from the following link: https://tsb-opendata.s3.eu-central-1.amazonaws.com/plz/plz.geojson

I am opening the file using requests.get().json() in python 3 and parsing the data in order to create a list of representative points using polygon boundaries with shapely, transforming coordinates to the Albers Equal Area projection to calculate polygon areas, and combining lists creating a Pandas dataframe of the postal codes, their representative points, and areas. My code for doing so looks like this:

import numpy as np
import pandas as pd
import requests
import urllib
import io
import json
import lxml.html as lh
import bs4 as bs
import pyproj
from shapely.geometry import shape, Point
import shapely.ops as ops

urlzip = 'https://tsb-opendata.s3.eu-central-1.amazonaws.com/plz/plz.geojson'
berlinzip_json = requests.get(urlzip).json()

area = []
lats = []
lons = []
name = []

for feature in berlinzip_json['features']:
        name.append(feature['properties']['plz'])
        polygon = shape(feature['geometry'])
        p=polygon.representative_point()
        lons.append(p.x)
        lats.append(p.y)
    
    geom_aea = ops.transform(
        pyproj.Proj(
        proj='aea',
        lat_1=polygon.bounds[1],
        lat_2=polygon.bounds[3]),
    polygon)
    a = geom_aea.area/1000**2
    area.append(a)
postal_codes= pd.DataFrame(data={'Postal Code':name, 'Latitude':lats, 'Longitude':lons, 'Area':area})
postal_codes = postal_codes.iloc[0:191,:]
postal_codes.sort_values(['Postal Code'], ascending=True, axis=0, inplace=True)
postal_codes

In the output I always get the same result, an unwanted duplicate.
Duplicate shown in Dataframe

I suspected it might be duplicated within the GeoJSON file itself, but when I manually checked the file the particular postal code only appears once and there don't appear to be any abnormalities. As shown in GeoJSON file

I am at a loss for why this error of unwanted duplication occurs with the same postal code (14193) each time I run the code. I wonder if it is an issue in my own code, but I do not see anything that would produce one lone duplicate each time. For now I am simply deleting the duplicates after I run the code, so it has not been a major issue, but I can't help but ask whether anyone else who has experienced this issue and can provide some insight into what is causing this issue.

Any thoughts and ideas would be greatly appreciated.

1

There are 1 answers

3
noob On

no problem with your code : 1- I manually checked the geojsom file using find: "plz": "14193" i fount two postal code with the same values

2- if you check your length of len(berlinzip_json['features']) you find it 194 the same records of you dataframe after removing postal_codes = postal_codes.iloc[0:191,:]

I think nothing wrong with your code