Timeout when uploading larger files

In our app we are having trouble with uploads of files that are ~300 MB or larger.
The app runs in a docker container on Rails6, Shrine 3.3, with an nginx-proxy. We are using direct uploads with the upload_endpoint plugin and uppy. The docker container has access to the file system of some other machine which is used for uploads (services like S3 being out of the picture for GDPR reasons).
Everything works perfectly for smaller files. However, when larger files are uploaded (meaning several hundred MBs), upon upload you can see the progress bar reaching 100%, apparently meaning that the xhr.upload.onprogress events are fired, but then nothings happens, uppy’s complete event (probably linked to xhr.onload) takes a very long time like nearly two minutes or never fires, sometimes a 504 is thrown. Note that we already raised uppys timeout setting.

Inspecting nginx logs you can see that when the upload process starts, nginx writes the upload to a tmp file. In the cases where the upload finally works, you see in Rails’ logs that shrine succeeds, e.g. you see something like that

today at 23:09 Upload (21797ms) – {:storage=>:cache, :location=>"******", :io=>Shrine::RackFile, :upload_options=>{}, :uploader=>VideoUploader}

(this is for a file of 1 GB) but the upload time given there does not correspond to the waiting time between reaching 100% and uppy-success, which is a lot bigger. So, something (or rather, nothing) seems to happen between the completion of the writing of nginx’ tempfile and shrine’s uploading of the file to cache. We are wondering where that time difference is coming from and what is going on there.

In the cases where it does not work we see something like this

today at 09:37 nginx.1 | 2020/11/10 08:37:52 [warn] 99#99: *9761 upstream server temporarily disabled while reading response header from upstream, client: *******, server: localhost, request: "POST /videos/upload HTTP/1.1", upstream: "*********:3000/videos/upload", host: "localhost:3009", referrer: "******************"
today at 09:37 nginx.1 | 2020/11/10 08:37:52 [error] 99#99: *9761 upstream timed out (110: Operation timed out) while reading response header from upstream, client: **********, server: localhost, request: "POST /videos/upload HTTP/1.1", upstream: "**************:3000/videos/upload", host: "localhost:3009", referrer: "*****************"

One thing that could possibly help to avoid the nginx timeouts would be to raise some nginx timeout settings (which ones?). But that would possibly still mean clients see this very long waiting period.

Do you have any insight in what is happening here? What could we do to improve the situation? Would it be possible to use something like the nginx-upload-module together with shrine? Or are we on the completely wrong track and the issue is caused by some other misconfiguration in our nginx/docker setup? Any help is appreciated.

Thank you very much!

That is indeed strange. The only time I experienced an issue similar to this is when consulting on an app that used service workers. The direct upload often didn’t work well, and removing service workers fixed it (I also had to explicitly deregister them from my browser).

Do you have worker timeout configured on the web server level? I’m wondering if the worker timed out before Rails was able to process the request, not giving enough time for Rails response to be written to the TCP socket. That could explain the error, because it looks as if though Nginx is waiting for the response that never comes. Do you see in application logs that Rails normally received requests and processed responses?

I’m not familiar with nginx-upload-module, but it should be possible to write a Shrine plugin that integrates with it. However, I’m currently tied with other open source projects, so my time doesn’t allow it.

An alternative could be to host MinIO on one of your servers, which would act as your own private S3. Then you can do direct uploads there. But that would probably just move the problem to another place. Still, maybe there is something in the Rails stack that’s not playing well with Nginx. Just a thought.

Thanks a lot for your input! We do not use service workers. After some more inspection, it seems that in our case, the problem is a combination of multiple factors: a server with apparently only borderline enough RAM to handle the traffic we are experiencing now (due to corona, traffic has gone up a lot), timeout settings that were too defensive and a rather naive workflow for our uploads, especially concerning metadata extraction. Due to the lack of RAM it seems that things like synchronous metadata extraction in shrine using ffmpeg upon direct upload take way too long. So, what we did was to move metadata extraction to promotion (asynchronous), and we changed the server timeouts. This changes alone shifted the point where problems occur from ~400 MB files to 2GB files which is fine for us as files of GB size are not our use case. More RAM will be added soon, maybe that will further improve the situation.
By the way, there are some similar issues to be found around the internet, many of which are related to issue #1075 in rack, which was (among other commits) fixed by this increase of buffer size which was later reverted in this commit. I wonder if increasing the buffer size will help for large files.

Huh, I remember reading about that Rack issue, and being glad the issue was solved. I didn’t know that change had been reverted.

Some time ago I was actually playing with the idea of adding PUT upload support to Shrine’s upload_endpoint, to bypass the overhead of Rack’s multipart parsing. A PUT upload would mean that file content = request body, so there is no parsing that needs to happen, the request body is just directly streamed into the Shrine storage.

However, I didn’t proceed with implementing it because I thought POST multipart uploads were good enough. Maybe it’s time to revisit that. FWIW, this is what I came up with, but never finished. What’s good is that Uppy already supports PUT uploads, you just need to pass method: 'put' when initializing the XHR Upload plugin.

Maybe you could test out the performance difference. Instead of doing a POST upload to Shrine’s upload_endpoint, you could try doing a PUT upload to a custom controller action, that will upload to Shrine’s storage. I think the following should work, assuming your web server’s rack input stream is rewindable (most are):

Rails.application.routes.draw do
  # ...
  put "/upload" => "uploads#create"
end
class UploadsController < ApplicationController
  def create
    uploaded_file = Shrine.upload(request.body, :cache, action: :upload)

    render json: uploaded_file.data
  end
end

It seems that the change on Rack Issue #1075 was only reverted for an older version of rack, so in the current release the buffer_size is quite optimal. Sorry, I misread that.
We tried to use PUT a you suggested, but we could not see any performance change (EDIT: That is incorrect, I mixed up some numbers. In fact it is even slower by a factor of 3. In the example below, PUT needs 18s, POST needs 6s). Maybe we did something wrong; her you can see the log for a file of around 300 MB:

Started PUT "/upload_test" for 172.18.0.1 at 2020-11-21 12:27:26 +0000
Processing by UploadsController#create as */*
Parameters: {"name"=>"*********.mp4", "type"=>"video/mp4", "file"=>#<ActionDispatch::Http::UploadedFile:0x00007f22b855aa78 @tempfile=#<Tempfile:/tmp/RackMultipart20201121-1-x68x2.mp4>, @original_filename="******.mp4", @content_type="video/mp4", @headers="Content-Disposition: form-data; name=\"file\"; filename=\"***************.mp4\"\r\nContent-Type: video/mp4\r\n">}
Metadata (0ms) – {:storage=>:cache, :io=>Tempfile, :uploader=>Shrine}
Upload (18860ms) – {:storage=>:cache, :location=>"44638d971a78fbcb06f7760e4ab171c9", :io=>Tempfile, :upload_options=>{}, :uploader=>Shrine}

This is in development on a very fast machine. What you can see is that the Tempfile still has ‘RackMultipart’ in its name, so maybe there is still parsing taking place.
Perhaps something else in our Rails stack does not add up and results in these large parsing times, we will check that.
As I said, currently the files that are uploaded in our app are usually less than in the above example, so for the time being things seem to work okayish.