Dynamic S3 keys and metadata compression

I migrating from an existing S3 media system to Shrine, but I have ~130M records and I’m looking for ways to shrink the storage size of the metadata (and the best place to implement that). I can think of three ways to reduce the size of the hash:

  1. Replace hash keys with shorter forms (e.g. ‘metadata’ -> ‘md’, ‘derivates’ -> ‘dv’, etc.)
  2. Don’t store static mime types, e.g. if an attachment is always PNG don’t save the mime type
  3. Legacy S3 keys are organized by the model that contains them. Instead of calculating the full key and storing it, generate dynamic keys on the fly

#1 is easy enough, and can be implemented at the Mongoid field level
#2 is slightly trickier, can be handled by Mongoid but would it be better implemented in an Attachment or Uploader?
#3 is the hardest, as it requires the attachment have visibility into the model that it’s attached. I think this requires a custom Attacher for each legacy field, but I’m not sure how best to override the key generation.

Where is the best place to implement metadata transform logic like this? Are there any examples of people configuring Shrine to use legacy S3 keys?

Thanks …

For #1 and #2, you can override Attacher#data and Attacher#load_data:

class Shrine::Attacher
  def data
    data = super

    if data
      result       = abbrev_data(data)
      result["dv"] = data["derivatives"]&.transform_values do |derivatives_data|
        abbrev_data(derivatives_data)
      end
    end

    result
  end

  def load_data(data)
    if data
      result                = deabbrev_data(data)
      result["derivatives"] = data["dv"]&.transform_values do |derivatives_data|
        deabbrev_data(derivatives_data)
      end
    end

    super(result)
  end

  private

  def abbrev_data(data)
    {
      "id"      => data.fetch("id"),
      "storage" => data.fetch("storage"),
      "md"      => data.fetch("metadata"),
    }
  end

  def deabbrev_data(data)
    {
      "id"       => data.fetch("id"),
      "storage"  => data.fetch("storage"),
      "metadata" => data.fetch("md"),
    }
  end
end

For #3, I generally recommend against that approach, because it can cause you to accidentally invalidate your links if you happen to change a model attribute used for generating the location.

That being said, you can use the system above to abbreviate the id value, as you have access to the record instance:

  def abbrev_data(data)
    result = { ... }
    result["id"] = abbrev_id(result["id"]) if data["storage"] == "store"
    result
  end

  def deabbrev_data(data)
    result = { ... }
    result["id"] = deabbrev_id(result["id"]) if result["storage"] == "store"
    result
  end

  def abbrev_id(id)
    File.basename(id) # shorten just to the unique identifier
  end

  def deabbrev_id(id)
    "#{record.some}/#{record.stuff}/#{id}" # retrieve the actual location
  end
1 Like