We have an erlang/elixir application (on 18/erts 7.3.1) that processes large json payloads.
Here's a typical workflow:
A listener gets a token from rabbitmq and sends to a gen_server.
The gen_server puts the token into a ETS table with a future time (current + n secs). A schedule job in the gen_server will pick up expired tokens from ETS and launch several short lived processes with these tokens.
These short lived processes downloads 30-50k json payloads from elasticsearch (using hackney) and processes it, then uploads the results back to elasticsearch, following that the process dies immediately. We keep track of these processes and have confirmed they die. We process 5-10 of these requests per second.
The Problem: We see an ever growing binary space and within 48 hours this grows to couple of GBs (seen via observer and debug prints). A manual GC also has no impact.
We already added "recon" and ran recon:bin_leak, however this only frees up few KBs and no impact on the ever growing binary space.
stack: Erlang 18/erts 7.3.1, elixir 1.3.4, hackney 1.4.4, poison 2.2.0, timex 3.1.13, etc., none of these apps holding the memory either.
Has anyone come across a similar issue in the past? Would appreciate any solutions.
Update 9/15/2017:
We updated our app to Erlang 19/ERTS 8.3 and hackney and poison libs to latest, still there is no progress. Here's some log within a GenServer, which periodically sends a message to itself, using spawn/receive or send_after. At each handle_info, it looks up an ets table and if it finds any "eligible" entries, it spawns new processes. If not, it just returns a {:noreply, state}. We print the VMs binary space info here at entry to the function (in KB), the log is listed below. This is a "queit" time of the day. You can see the gradual increase of binary space. Once again :recon.bin_leak(N) or :erlang.garbage_collect() had no impact on this growth.
11:40:19.896 [warn] binary 1: 3544.1328125
11:40:24.897 [warn] binary 1: 3541.9609375
11:40:29.901 [warn] binary 1: 3541.9765625
11:40:34.903 [warn] binary 1: 3546.2109375
--- some processing ---
12:00:47.307 [warn] binary 1: 7517.515625
--- some processing ---
12:20:38.033 [warn] binary 1: 15002.1328125
We never had a situation like this in our old Scala/Akka app which processes 30x more volume runs for years without an issue or restarts. I wrote both apps.
We found that the memory_leak came from a private reusable library that sends messages to Graylog and that uses the function below to compress that data before sending it via gen_udp.
Instead using term_to_binary(data, [:compressed]) I could have saved some headaches.
Thanks for all the inputs and comments. Much appreciated!