Bulk import from csv memory usage

I’m facing a problem importing over 10.000 files in Shrine.

I have the data in a CSV and for every record I need to attach a picture but the memory usage of the importing script grow very fast. After a couple of thousands records it’s well over 6 GB.

Importing other kind of records (always with images) I didn’t experienced this issue.

Can anyone help?

this is the code of the importer:

def populate_imagebase
  uploader = ImageUploader.new(:store)
  imagebase_file = Rails.root.join('lib', 'seeds', 'imagebase.csv')
  CSV.foreach(imagebase_file, headers: true, encoding: 'UTF-8').with_index do |row, index|
    next if ImageBase.exists?(row['fid'].to_i)

    puts "Importing image #{index} "

    ImageBase.new(
      id: row['fid'],
      opera_it: row['opera'],
      opera_en: row['artwork'],
      year: Year.find_by(nome_it: row['anno']),
      thematic_area: ThematicArea.find_by(nome_it: row['area_tematica']),
      city: City.find_by(nome_it: row['citta']),
      state: State.find_by(nome_it: row['stato']),
      data: row['data'],
      fotografo: row['fotografo'],
      collocazione: row['collocazione'],
      immagine_it: row['immagine'],
      immagine_en: row['image']
    ) do |i|
      i.save!

      i.keywords = row['keywords'].split(/,/) unless row['keywords'].nil?
      a = Author.find_by(name: row['nome'], surname: row['cognome'])
      i.authors << a unless a.nil?
      a2 = Author.find_by(name: row['nome2'], surname: row['cognome2'])
      i.authors << a2 unless a2.nil?
      unless row['filename'].nil?
        if File.file?('../imagebase_images/' + row['filename'])
          puts File.basename('../imagebase_images/' + row['filename'])
          file = File.open('../imagebase_images/' + row['filename'], 'r')
          i.image = uploader.upload(file)
          i.image_derivatives!
        end
      end

      c = City.find_by nome_it: row['citta']
      i.city = c unless c.nil?
      y = Year.find_by nome_it: row['anno']
      i.year = y unless y.nil?
      t = ThematicArea.find_by nome_it: row['area_tematica']
      i.thematic_area = t unless c.nil?
      s = State.find_by nome_it: row['stato']
      i.state = s unless s.nil?

      i.save!
    end
  end

  # puts "#{Image.count} Cities saved"
  # puts images_array.inspect
end

I run the script with rails r

Thanks,
Frank

I’ve put an effort into optimizing memory allocations for uploads and downloads, but I could only go so far. Which Ruby and Shrine versions are you using, and which storage? Have you done some memory profiling yourself, e.g. using the memory_profiler gem?

Hi @janko

ruby 2.7.1p83
Rails 6.0.3.2
I’m using FileSystem as storage

I still haven’t profiled it.

What I can’t understand is that in another import memory is freed without issues,
the only difference is that the image holder is an associated record created during the import.

this is the code for the other model

def populate_projects
  uploader = ImageUploader.new(:store)
  # deserialize images -> .split(/\^/).reject!{|c|c.empty?}
  data = File.read(Rails.root.join('lib', 'seeds', 'progetti.csv'))
  csv = CSV.parse(data, headers: true, encoding: 'UTF-8')
  csv.each do |row|
    # puts row
    Project.new do |p|
      p.nome_it = row['progetto']
      p.nome_en = row['progetto']
      p.save
      unless row['immagine'].nil?
        if File.file?('../progetti_images/' + row['immagine'])
          file = File.open('../progetti_images/' + row['immagine'])
          img = Image.new
          img.image = uploader.upload(file)
          img.image_derivatives!
          p.cover = img
        end
      end
      p.save!
    end
  end
end

Any idea?

I will try to profile the script (not an expert on this)

Thanks

I’ve changed the logic, cycling the records and not the CSV data, and doing it in batches and running it thru rake.
Things are better but still strange.
memory usage is around one GB now, but memory_profiler is reporting a whole different story:

here the new code:

  task import: :environment do
    imagebase_file = Rails.root.join('lib', 'seeds', 'imagebase.csv')
    data = File.read(imagebase_file)
    uploader = ImageUploader.new(:store)
    csv = CSV.parse(data, headers: true, encoding: 'UTF-8')
    imagebase_root = Rails.root.join('..', 'imagebase_images')

    ImageBase.find_in_batches(batch_size: 100) do |images|
      report = MemoryProfiler.report do
        ImageBase.transaction do
          images.each do |image|
            row = csv.find {|row| row['fid'] == "#{image.id}" }
            if File.file?(imagebase_root + row['filename'])
              puts File.basename(imagebase_root + row['filename'])
              file = File.open(imagebase_root + row['filename'], 'r')
              image.image = file
              image.image_derivatives!
              file.close
              image.save
            else
              puts "image not found #{row['filename']}"
            end
          end
        end

        # GC.start(full_mark: true, immediate_sweep: true)
        # puts "======================================================"
        # puts GC.stat(:total_freed_objects)
        # puts GC.stat(:total_allocated_objects)
        # puts "======================================================"
      end
      report.pretty_print(to_file: 'import_log',scale_bytes:true,detailed_report:true)
    end
  end

Thanks

Streaming content from one file on disk to another should have constant memory usage, unless IO.copy_stream is not freeing some memory.

What does memory_profiler say?

this is one of the logs with highest memory consumption (I’ve lost the others)

Tell me what you need

Total allocated: 293.96 MB (3655849 objects)
Total retained:  13.50 MB (98432 objects)

allocated memory by gem
-----------------------------------
  58.07 MB  actionpack-6.0.3.2
  57.97 MB  activesupport-6.0.3.2
  50.11 MB  activemodel-6.0.3.2
  48.96 MB  activerecord-6.0.3.2
  11.50 MB  railties-6.0.3.2
   9.30 MB  shrine-3.3.0
   9.28 MB  fastimage-2.2.0
   7.50 MB  other
   6.70 MB  fileutils
   6.68 MB  bootsnap-1.4.8
   5.82 MB  zeitwerk-2.4.0
   5.29 MB  site/lib
   4.43 MB  csv
   2.74 MB  site/app
   2.18 MB  rails_admin-2.0.2
   1.82 MB  pathname
 866.40 kB  uri
 803.49 kB  activejob-6.0.3.2
 749.67 kB  json
 508.08 kB  delegate
 417.84 kB  actionview-6.0.3.2
 365.60 kB  globalid-0.4.2
 357.60 kB  set
 352.00 kB  concurrent-ruby-1.1.7
 221.80 kB  kaminari-activerecord-1.2.1
 208.00 kB  actionmailbox-6.0.3.2
 179.74 kB  logger
 124.00 kB  activestorage-6.0.3.2
  86.40 kB  tzinfo-1.2.7
  79.40 kB  securerandom
  64.00 kB  weakref
  57.60 kB  mutex_m
  41.60 kB  actionmailer-6.0.3.2
  24.80 kB  actioncable-6.0.3.2
  24.80 kB  sprockets-rails-3.2.1
  24.00 kB  mimemagic-0.3.5
  16.00 kB  marcel-0.3.3
  10.40 kB  monitor
   7.30 kB  actiontext-6.0.3.2
   4.00 kB  rails_admin_import-2.2.0
   4.00 kB  time

Please include the full memory_profiler report, as the remainder usually includes the crucial information for determining the main source of memory allocations. You can put it into a gist.

The part you’ve included is a bit suprising to me. The script you’ve executed doesn’t seem to be related to controller context, yet Action Pack is allocating the most memory.

Notice how Shrine is in general low on the memory allocation list compared to Rails components. This indicates to me that the source of high memory usage is somewhere in Rails, rather than in Shrine.

here the full report

Ok, so from the 2nd paragraph of the output, you can see that most of the allocated memory is coming from Active Model attributes, which are loaded by your Active Record instances. So that’s pure Active Record overhead.

Then we see Action Dispatch routing, which are probably Rails routes that are somehow loaded/called in your script.

Then it’s Active Record logging, Active Support backtrace cleaning and callbacks etc, all of which is still pure Rails.

We see some Shrine and FastImage, but it’s pretty small memory usage. So, as I predicted, most of your memory usage is coming from Rails itself, not Shrine. You should look into optimizing those points.

Thank you @janko,
I thought the same.
The impression is that, for some reason, the script didn’t free the memory.
The strange thing is that if I don’t save the object or run it without attaching the file the memory is freed normally.

If you have more suggestions please share them

Thank you again,
Frank

Hmm, that’s curious then. Perhaps memory_profiler isn’t detecting potential memory leaks happening in C code (e.g. in IO.copy_stream I mentioned earlier).

So, the problem could be in the Ruby language, it does sound like a memory leak. I suggest you try to come up with a minimal self-contained script that reproduces the leak.

I’ll try.
But now I’m a bit busy with other projects.

Thanks