Derivatives and persistence

OK, I’m afraid this is another long one.

At first, reading the derivatives docs it wasn’t initially clear to me that, when used with an activerecord (or presumably sequel) model, none of the operations in there would persist any derivatives. Calling model.file_attacher.create_derivatives or model.file_attacher.add_derivative(:thumb, thumbnail_file result in a model with unsaved changes.

After thinking about it, I get that shrine tries not to assume persistence at all, and you can use derivatives without an ORM or persistence mechanism shrine knows about at all, so, okay. But possibly it would be useful to have the docs point out that the derivatives API doesn’t persist anything to db (although it does persist files to storage!)

The File Processing guide is also really to be recommended to wrap your head around derivatives, and does have some discussion of persistence, at least in the context of backgrounding.

But the bigger issue is when I moved on to handling persistence, I found that it’s actually kind of tricky to handle it safely and reliably, for general use. The two main categories of issues are:

  1. avoiding “orphaned” files in storage, that aren’t referenced by any persisted model, and will just sit around taking up space
  2. Concurrency safety, if you ever do any derivative mutation after initial promotion

I think they’re both a bit trickier than they initially appear. I will discuss both in the context of activerecord; sequel should be very similar.

Orphaned derivative files

Let’s first ignore concurrency, in fact let’s look at creation of a brand new model, in an uploader that doesn’t use backgrounding at all. The naive approach to persistence might be to realize that create_derivatives doesn’t do it, so just add a ‘save’ on the end

user = User.create!(username: 'jrochkind', image: File.open(something))

user.image_attacher.create_derivatives
user.save!

I think there is some risk of orphaned derivative files in storage here. What happens if the user fails to save? Maybe it was a validation error, that raised an exception. Or maybe the database itself refused to save because of a constraint violation; or, maybe there was a network error or other database outage resulting in some other exception.

So maybe you log this error, maybe report this error to the user, maybe you schedule the bg job to be run again if it was a bg job… but do you realize some/all of the derivative files may still be sitting on storage? Unreferenced by any model? In default configuration, I think this will even ordinarily be the store storage (rather than cache which you might expect orphaned files to be already). Taking up space, maybe being billed for the storage on a cloud provider. It’s pretty tricky to discover/identify orphaned files if you did accidentally create them.

The best way I could find to avoid that unfortunately involves splitting out the nice create_derivatives call into it’s components, so we can be sure we’re keeping track of the files added (no matter when the exception happened). As long as we’re at it, I threw in making sure to clean up the temporary local files too – they are probably TempFiles and will probably get cleaned up even an exception happened before shrine could, but if something really bad happens we don’t want our file system filling up.

user = User.create!(username: 'jrochkind', image: File.open(something))
begin
  local_files = user.image_attacher.process_derivatives
  new_derivatives = user.image_attacher.upload_derivatives(local_files)
  user.image_attacher.merge_derivatives(new_derivatives)
  user.save!
rescue StandardError => e
  # delete any leftover files and re-raise
  # yes delete_derivatives works on the local files returned by process_derivatives too,
  # and handles the fact that they might be nested hashes/arrays
  user.image_attacher.delete_derivatives(local_files) if local_files
  user.image_attacher.delete_derivatives(new_derivatives) if new_derivatives

  raise e
end

Is there a better/simpler way to reliably ensure there are no orphaned derivatives on storage, despite any unexpected errors?

Concurrency

atomic_persist is suggested for modifying derivatives in a concurrency safe way – like, what if someone else is also modifying different derivatives on the same model concurrently, you want them both to get in there. The File Processing guide includes an example:

   attacher.create_derivatives(name: derivative_name)
    attacher.atomic_persist do |reloaded_attacher|
      # make sure we don't override derivatives created in other jobs 
      attacher.merge_derivatives(reloaded_attacher.derivatives)
    end

This works for the example in the File Processing guide, where you have concurrent processes each adding a different set of new derivatives, starting from an empty slate of no derivatives.

But what if you’re not starting from an empty slate? Derivatives already exist, and you might have concurrent processes replacing them, or removing them in addition to adding brand new derivatives. Maybe you realized some derivatives are corrupt and have to be re-created, and you want to use concurrency to do so.

The problem is that since all derivatives are written as a single JSON, if some other process tried to replace derivative A with A’, and your process is trying to add B, process b adding B may also set A’ back to A. Or “undelete” it if someone else deleted it – annoying if you were running a batch process to delete all B’s, and you thought you deleted them all, but then process A accidentally ressurected it from the dead, and you may never notice it’s there taking up space.

May not happen very often, but derivative management does happen, and don’t we want a way to handle this completely reliably?

I don’t think the code given in File Processing will do it; but we do have the right primitives available to do it. Here’s what I came up with (and wrote some tests for), again finding it easier to think through exploding create_derivatives into it’s components.

local_files = user.image_attacher.process_derivatives
new_derivatives = user.image_attacher.upload_derivatives(local_files, **options)

user.image_attacher.atomic_persist do |reloaded_attacher|
  # start with the up to the minute current derivatives hash from the db, in case
  # it changed after we loaded it, then add in our new derivatives
  set_derivatives(reloaded_attacher.derivatives)
  merge_derivatives(new_derivatives)
end

Ah, but do we have orphaned derivative issues here too? What if the atomic_persist fails, do we need to avoid the orphaned derivatives we created but never saved? Yes – and this actually applies to the example from File Processing Guide too, oops.

And, in a world where you might be replacing already existing derivaties (which add_derivatives and create_derivatives do just fine, before considering persistance or orphaned files) – you have to make sure to clean up any replaced derivative files from storage, as discussed in https://github.com/shrinerb/shrine/issues/468

Putting it all together

So I really wanted some generically reliable routines that that would let me mutate derivatives in a way that was concurrency-safe (including for changes and deletes not just add new), and reliably wouldn’t leave orphaned files around.

The good news is – I think I was able to come up with some! I believe they are reliable! They aren’t very many lines of code – shrine gave me the right primitives to do it, which is actually really incredible and awesome.

The less good news is it took me, a fairly experienced developer (including with shrine) quite a number of hours to come up with them.

This makes me wonder if shrine needs more support for safely mutating derivatives – either higher-level API, or docs, or both.

I encapsulated what I came up with into a shrine plugin, in fact. It’s not totally done yet (and still not merged to my own project master branch), but you can see the methods as they are now, plus some tests that try to test concurrency safety too:

I would certainly be willing to contribute these to shrine… but I also recognize they are kind of tricky, and certainly could have bugs or design flaws, and you may not want to take on the maintainance burden of them.

For myself though, while it took a while, I am pleased that shrine seemed to support this fine, and I currently believe I came up with some general purpose reliable routines as above.