This is just a sample minimal test to reproduce memory leakage issue in remote Dask kubernetes cluster.
def load_geojson(pid):
import requests
import io
r = requests.get("https://github.com/datasets/geo-countries/raw/master/data/countries.geojson")
temp = r.json()
import sys
size_temp = sys.getsizeof(temp)
del temp
return size_temp
L_geojson = client.map(load_geojson, range(200))
del L_geojson
Observation: Steady increase in worker memory(Bytes Storage) by approx 30 MB on each run and keeps on increasing until whole memory is used. Another test I tried with urllib, I observed there was a random increase and decrease in memory on each run.
Desired behavior: Memory should be cleaned up after the reference L_geojson is deleted.
Could someone please help with this memory leakage issue?
I can confirm an increase in memory and "full garbage collections took X% CPU time recently" messages,. If I allow the futures to run, memory also increases, but more slowly.
Using
fsspec
does not have this problem, as you found withurllib
, and this is what Dask typically uses for its IO (fsspec
switched fromrequests
to usingaiohttp
for communication).Your modified function might look like
but you still get garbage collection warnings.