S3 - Multipart upload with fog
16 Jan 2014I needed to understand multipart uploads with fog and how to keep the ETag MD5sum the same with multipart or single stream uploads to S3. In the end it was very embarrassing for Amazon to not have the ETag working the same for both. I understand the fact that both had a different route on getting to S3, but in the end 1 file is created meaning 1 ETag algorithm should be used.
The basics is to do the multipart upload, then overwrite the file with itself, S3 will then give you the proper md5sum under the ETag value. Below is the code to do the upload then copy the file to itself with a normalized ETag.
require 'fog'
require 'digest/md5'
require 'filemagic'
connection = Fog::Storage.new({ :provider => "AWS", :aws_access_key_id => access_key, :aws_secret_access_key => secret_key })
file = File.absolute_path(file)
file_name = "myfiles/#{File.basename(file)}"
file_md5sum = Digest::MD5.file(file).to_s
multipart_uploads = connection.list_multipart_uploads("myS3bucket")
multipart_uploads.data[:body]["Upload"].each do |part|
if part["Key"] == file_name
abort_result = connection.abort_multipart_upload(myS3bucket, file_name, part["UploadId"])
end
end
content_type = FileMagic.new(FileMagic::MAGIC_MIME).file(file).split(';').first
file_temporary_directory = Dir.mktmpdir
file_split_result = system("/usr/bin/split -C 10M -a 4 -d #{file} #{file_temporary_directory}/")
file_md5_parts = {}
file_parts = Dir.glob("#{file_temporary_directory}/*").sort
file_parts.each do |file_part|
file_md5_parts[file_part] = Digest::MD5.file(file_part).base64digest
end
s3_multipart = connection.initiate_multipart_upload(myS3bucket, file_name, { 'x-amz-acl' => access, 'Content-MD5' => file_md5sum, 'Content-Type' => content_type } )
s3_upload_id = s3_multipart.body["UploadId"]
md5_indices = []
threads = []
file_md5_parts.each_with_index do |file_md5_part, index|
file_part_number = index + 1
connection.reload
threads < < Thread.new(file_md5_part) do |f|
File.open(file_md5_part[0]) do |part|
response = connection.upload_part(myS3bucket, file_name, s3_upload_id, file_part_number, part, { 'Content-MD5' => file_md5_part[1] } )
md5_indices[index] = response.headers['ETag']
end
end
end
threads.each do |t|
begin
t.join
rescue Exception => e
puts "failed upload: #{e.message}"
exit 1
end
end
connection.reload
begin
completed_upload_result = connection.complete_multipart_upload(myS3bucket, file_name, s3_upload_id, md5_indices)
rescue => error
exit 1 if error.message =~ /400 Bad Request/
end
if completed_upload_result.status == 200
copy_result = connection.copy_object(myS3bucket, file_name, myS3bucket, file_name, { 'x-amz-metadata-directive' => 'REPLACE' })
if copy_result.status == 200
exit 1 if copy_result.data[:body]["ETag"] != file_md5sum
end
end
I have used examples of code from multiple locations on the web, mostly searching for 'multipart uploads fog'
http://blog.vicecity.co.uk/posts/168491-multipart-uploads-fog-threads-win
http://baldowl.github.io/2011/02/18/multipart-uploads-with-fog.html