Unzipping large files and validating for duplicates - how to do this?

Hi friends

Was wondering about how to go about this type of work flow (see below). Any advice pertaining to the use of AWS lambda, and its suitability for the below case would be well regarded: unzipping large files. (AWS lambda is by no means a prescriptive requirement): i just want to unzip files and run validations against them.

  1. The user uploads files. These could be compressed files (.zip, .rar, .7zip) or uncompressed. The files may vary in size: 1 mb to perhaps a couple of gbs.
  2. I think if we upload these files directly to the bucket that would be good.
  3. If the files are zipped, then i would want them unzipped (perhaps with AWS lambda?) and presented to the user in a form. The user may then add further information regarding those files in form fields - which will be persisted by ActiveRecord.
  4. I do not want the user to upload duplicate files. For example, if a file has already been uploaded last week, I’d want some mechanism preventing the upload. Necessarily this might involve querying the back end, or some validation type logic happening here.

Was seeking your advice on how to go about the above - particularly that relating to unzipping files. What is the best way to handle this in a rails app given it’s hosted on heroku (with it’s limitations on being able to store data)?

any thoughts suggests would be much appreciated

thank you

Ben

I don’t have any experiences with AWS Lambda, but I can give some ideas from the top of my mind:

  1. If uploads can be up to couple of GBs, consider making the uploads resumable.
  2. Yes, this is possible with presigned uploads.
  3. You can use rubyzip for unzipping; it supports working with streams, so you don’t need to touch the disk. You can do the unzipping in a background job, then attach the contained files to your record (see the Multiple Files guide on how to attach multiple files to a record).
  4. You can calculate signature for attached files, and have the value stored in separate column, then add a uniqueness constraint/validation on that column.
1 Like

Appreciate the response Janko. I will go through the links you’ve provided.

hi thank you Janko,

(I am very appreciative of the time and craftsmanship you have put into this gem.)

A question on point #4: suppose we use the presigned uploads methods. The files are uploaded directly to the bucket. Does the bucket then return the meta data to the user form, which is then posted to the rails/roda back end to be persisted in the database? Or is the meta data calculated in the back end itself by querying the files in the bucket?

In other words, is the meta data calculated on uploading directly to the bucket, or is it calculated in the back end, after the files have already been uploaded to the bucket?

A pointer would be much appreciated.

rgds

Ben

Hi Ben,

As far as I can see, AWS S3 doesn’t have the ability to return an MD5 signature, so you’ll have to calculate one yourself.

One option is calculating the signature on the client side before the upload, and adding it to the uploaded file metadata hash that’s sent to the app. This wiki page has some examples of calculating an MD5 hash on the client side, though it’s for sending it in the upload for checksum, so I don’t know how to calculate it separately from the upload considering it’s async.

Another option is calculating the signature on the server side after the upload. If you have the restore_cached_data plugin loaded, you can have signature calculated synchronously on assignment. If the performance impact of downloading the file from S3 on assignment is significant, this section has alternatives on how to move that to a background job. For example, you could only extract cheap metadata synchronously, and calculate signature in a background job.