Memory is not freed in worker after job ends

637 views Asked by At

Scenario:

I have a job running a process (sidekiq) in production (heroku). The process imports data (CSV) from S3 into a DB model using activerecord-import gem. This gem helps to bulk insertion of data. Thus dbRows variable sets a considerable amount of memory from all ActiveRecord objects stored when iterating CSV lines (all good). Once data is imported (in: db_model.import dbRows) dbRows is cleared (should be!) and next object is processed.

Such as: (script simplified for better understanding)

def import
      ....
      s3_objects.contents.each do |obj|
          @cli.get_object({..., key: obj.key}, target: file) 
          dbRows = []
          csv = CSV.new(file, headers: false)
          while line = csv.shift
              # >> here dbRows grows and grows and never is freed!
              dbRows << db_model.new(
                field1: field1,
                field2: field2,
                fieldN: fieldN
              )
          end
          db_model.import dbRows
          dbRows = nil   # try 1 to freed array
          GC.start   # try 2 to freed memory
      end
      ....
end

Issue:

Job memory grows while process runs BUT once the job is done memory does not goes down. It stays forever and ever!

Debugging I found that dbRows does not look to be never garbage collected and I learned about RETAINED objects in and how memory works in rails. Although I did not find yet a way to apply it to solve my problem.

I would like that once the job finished all references set on dbRows are GC and worker memory is freed.

any help appreciated.

UPDATE: I read about weakref but I don't know if is would be useful. any insights there?

1

There are 1 answers

2
Kache On

Try importing lines from the CSV in batches, e.g. import lines into the DB 1000 lines at a time so you're not holding onto previous rows, and the GC can collect them. This is good for the database, in any case (and for the download from s3, if you hand CSV the IO object from S3.

s3_io_object = s3_client.get_object(*s3_obj_params).body
csv = CSV.new(s3_io_object, headers: true, header_converters: :symbol)
csv.each_slice(1_000) do |row_batch|
  db_model.import ALLOWED_FIELDS, row_batch.map(&:to_h), validate: false
end

Note that I'm not instantiating AR models either to save on memory, and only passing in hashes and telling activerecord-import to validate: false.

Also, where does the file reference come from? It seems to be long-lived.

It's not evident from your example, but is it possible for references to objects are still being held globally by a library or extension in your environment?

Sometimes these things are very difficult to track down, as any code from anywhere that's called (including external library code) could do something like:

Dynamically defining constants, since they never get GC'd

Any::Module::Or:Class.const_set('NewConstantName', :foo)

or adding data to anything referenced/owned by a constant

SomeConstant::Referenceable::Globally.array << foo # array will only get bigger and contents will never be GC'd

Otherwise, the best you can do is use some memory profiling tools, either inside of Ruby (memory profiling gems) or outside of Ruby (job and system logs) to try and find the source.