Force S3 to not guess mime type from file contents

I’m storing HTML fragments in an S3 bucket. These are entire pages with just the contents of the body tag, and without the outer html, head, body etc. They have a .html file extension, however. I recently discovered that if you request them from S3 directly, they are served with a text/plain mime-type, so the browser doesn’t properly interpret them. Is there a way to force the setting of the correct text/html mime-type when the files are promoted to :store?

I found this blog, which seemed to indicate that --no-mime-magic passed as an argument to s3 was a key to fixing this behavior, but that’s not accessible through the uploader as far as I can tell…

Thanks in advance,

Walter

As far as I can tell, Shrine is properly getting the file-type and storing it in the file_data struct.

{"id":"titles/74/Acton_0001.html","storage":"store","metadata":{"filename":"Acton_0001.html","size":859715,"mime_type":"text/html"}}

It’s just S3 that is missing the memo.

Walter

Hi Walter, I’m not sure why you’re experiencing this issue. When Shrine is uploading the file to S3, it sends the determined MIME type to S3 in the :content_type option, which S3 uses for the Content-Type response header when downloading the file. The example below demonstrates this:

require "shrine"
require "shrine/storage/s3"
require "stringio"
require "http"

s3_options = {
  access_key_id: "...",
  secret_access_key: "...",
  region: "...",
  bucket: "...",
}

Shrine.storages = {
  cache: Shrine::Storage::S3.new(**s3_options, prefix: "cache"),
  store: Shrine::Storage::S3.new(**s3_options),
}

Shrine.plugin :determine_mime_type, analyzer: :marcel

html_file = StringIO.new("<html><head></head><body></body></html>")

cached_file = Shrine.upload(html_file, :cache)
HTTP.get(cached_file.url).mime_type #=> "text/html"

stored_file = Shrine.upload(cached_file, :store)
HTTP.get(stored_file.url).mime_type #=> "text/html"

Would you be able to reproduce the issue in a self-contained example?

I can try. The file as stored on S3 is being served with a text/plain MIME-type. The file_data hash on the database shows that the metadata analyzer did its work correctly, and extracted the text/html type and saved it there.

I just checked, and I’m using 3.3.0 of shrine, do you think that the latest version would be a good thing to try? Are there any gotchas in that upgrade?

Thanks for your help,

Walter

Looking into the code where I am assigning the HTML (exported from Nokogiri, after a long and torturous conversion from XML), I wonder if the fact that I am assigning to the metadata hash means that I am clipping off the other attributes that were there before the file is saved.

  def convert_xml
    return if original_type == 'Page'
    return unless xml?

    original.update_column(:xml_filename, File.basename(file_name, '.xml'))
    filename = "#{File.basename(file_name, '.xml')}.html"
    cx = XmlToTitle.new(file.download.read, original_id)
    new_file = original.sources.where(file_name: filename).first_or_initialize
    new_file.file_attacher.assign(StringIO.new(cx.to_html, 'rb'), metadata: { 'filename' => filename })
    new_file.save!
  end

…see the third-to-last line above. Should I also be adding the mime-type to that hash?

Walter

Metadata passed via the :metadata option are merged into the extracted metadata, so it shouldn’t override anything (and we’ve seen that mime_type is correctly set).

Using your test script, I got the same results as you did, naturally. But when I substituted a snippet of the HTML “fragment” I am working with for your well-formed-but-empty example, I got two different results. When the text started with a newline, as my script output does, I got back application/octet-stream. When I trimmed off that leading newline, I got back application/xml. It looks very much like S3 isn’t taking the advice of Shrine as to the mime-type of the file, and is insisting on running its own mime-magic on it anyway, much as that blog post I linked to hinted. I don’t know where (or if it’s possible) to add that --no-mime-magic flag, as that’s from a totally different command-line tool:

s3cmd --no-mime-magic --access_key=$ACCESS_KEY --secret_key=$SECRET_KEY \
  --acl-public --delete-removed --delete-after --no-ssl --host=$s3host    \
  --host-bucket="%(bucket)s.$s3host" 

I really can’t upload these files as complete HTML documents with a head and body and everything, as they are meant to be included in a surrounding document as fragments. This was working until a few weeks ago, I could swear. I used to be able to visit those URLs and see the file styled as HTML, even though it was incomplete. So I’m guessing something changed on the S3 defaults side of things. I just wish there was a way to force it back to the previous behavior.

Thanks for your help,

Walter

Using your test script, I was able to work out that if I added an HTML comment containing a doctype and an html tag, S3 would recognize the file as HTML. So I’ve updated my script to not strip out the <body> tag, but to replace it with <!-- <doctype html><html> -->. That seems to be enough to fool the S3 mime-magic into accepting that a document fragment is really HTML. I’d still like to know if there is any way to send that preference to disable mime-magic through Shrine’s S3 module.

Walter