-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 metadata values no longer support Unicode #478
Comments
Interestingly, Boto2 allows uploading files with Unicode metadata, but there's a bug that breaks generate_url with such objects: boto/boto#2556 Boto3 fixes the download half, but breaks the upload half. That is, I can upload an object with Unicode metadata with Boto2 but not Boto3. Once it's uploaded to S3, I can generate_url for it with Boto3 but not Boto2. |
S3 metadata has to be ASCII. From http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html:
If you try to send non-ascii data, it'll get "double encoded" which is what you're seeing in the console. This will also break the signature version 4 signer, which is required for new regions such as I think our best option here is to error/warn when we see user-defined metadata that's given that is contains non-ascii characters. |
That just raises more questions.
That would certainly be a lot more helpful than what it does now. |
To clarify your first question, the docs say they have to be ASCII when using REST, which is what all the AWS SDKs use now. The UTF-8 part is only possible when using SOAP, which we don't use. |
- Turns out S3 metadata values can only be ASCII, so using that to store the filename was problematic. boto/boto3#478 - All metadata values are thus percent encoded (strictly, without the + for space replacement) on read and write to S3. I'm going to run a manual migration to percent encode all the existing notebook filenames to prevent annoying long term inconsistencies. - Content-Disposition can also only apparently be ASCII by default without some funky encoding fun. We do the funky encoding fun. - Decided to just use quote instead of quote_plus everywhere. Fixes #35
@jamesls why does https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html say that you can upload UTF-8 in REST and that Unicode values get RFC-2047 encoded/decoded?
The examples provided are using the REST API. Is this a Boto bug where it's forgetting to apply RFC-2047 transparently to custom metadata? |
In the S3 documentation, I don't see anywhere that defines exactly what user-defined metadata are, but it says:
Only text has a UTF-8 encoding, and thus, I conclude that values are Unicode strings. That matches how I've been using them with Boto2 so far.
In Boto2, this was supported. I could do this:
and in the S3 Management Console, it appears as key "x-amz-meta-foo", value "%F0%9F%93%88" (the URI encoding of U+1F4C8). It's a little funny that the S3 console is re-encoding this in a different way, but the S3 console is pretty bare-bones, and the re-encoding confirms that everything upstream recognizes that it's a Unicode string.
In Boto3, this doesn't work. When I try to do:
I get the remarkably unhelpful error message:
I looked around for anything that might suggest this changed from Boto2 to Boto3. The Boto3 documentation for put_object says it's of type:
and makes no mention of Unicode/ASCII limitations. (Elsewhere in the same function call,
b'bytes'
is referred to, but Metadata isn't abytes
.)I tried calling
.encode('utf-8')
on my string before passing it toMetadata=
, but this doesn't work, either. I get an exception that ends with:As far as I can tell, this is a bug in Boto3. With Boto2, I was able to pass metadata
{'foo': u'\U0001f4c8'}
when putting an object in S3, and with Boto3, I'm not.The text was updated successfully, but these errors were encountered: