-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem parsing JSON with UTF-8 content #593
Comments
We can either assume a particular encoding and document it or we could add an |
From the command line, the user doesn't really have a choice right? It's whatever the encoding used by the terminal is so we can use the terminal's encoding to decode the json. For the file case, given that these are JSON files, could we automatically detect the common unicode encodings? (utf-8/utf-16(-le|be)/utf-32). That should cover the common cases for linux/windows/mac. The rfc has a section on how to do this, though it doesn't look like anything in the |
Auto-detecting sounds nice. I'm just wondering how hard we should try. For example, SJIS/MS932? |
Auto-detecting would be great. I didn't realize it was that straightforward. |
I think we could just autodetect/support utf-8/16/32 only. Given the json spec says that JSON should be encoded in unicode and calls out those three encodings, it might be a reasonable comprise to ask people to use one of those encodings (so assume utf-8/16/32). I checked on a windows instance, and using notepad to save a file with unicode chars using the "Unicode" encoding in the dropdown uses utf-16 (with a BOM) so even if we just supported utf-8/utf-16 that would accommodate windows/linux/mac users. |
Well, the OP on the ticket was using a variety of encodings for Japanese characters (e.g. SJIS/MS932) which would probably involve utf-32 although I'm certainly no expert on that. |
* release-0.31.0: (22 commits) Bumping version to 0.31.0 Remove debug logging message. Fix reference to no_auth. Allow for operations within a service to override the signature_version. Fixes #206. Supercedes #208 Fix setting socket timeout in py3 Add response parsing tests for S3 GetBucketLocation Expose output parameters matching root XML node, fix GetBucketLocation Use unittest2 on python2.6 Detect incomplete reads (content length mismatch) Simplifying code and fixing test to use unicode constant. Fixing an issue that came up while fixing aws/aws-cli#593. Fixing an issue that came up while fixing aws/aws-cli#593. Fix elastictranscoder service Add default param to get_config_variable Add session config vars for metadata retry/timeouts Add support for per session config vars Rename get_variable to get_config_variable Rename env vars to session vars Move module vars into session class vars Update elasticache model to the latest version ...
Did more research on this. There's a few issues here. First, in the case of unicode chars specified on the command line (and not in JSON files) then we should be decoding using whatever the encoding of the terminal is. We can use (note the
For the case of files, on python3 it will automatically used the default encoding via
This means that we should already support utf-8 encoded JSON files referenced via We do need to fix the case where args are specified on the command line by decoding via |
In python2, sys.argv is a bytestring of whatever encoding is used by the terminal. In python3, sys.argv is a list of unicode strings. This causes problems because the rest of the code assumes unicode. The fix is to automatically decode to unicode based on sys.stdin as soon as we parse the args. This was originally reported in aws#593, and boto/botocore#218. I'll need to more investigation to see if this problem applies to JSON files via file://, this commit only fixes the case where unicode is specified on the command line.
This is still a problem in Python 2 (2.7.6) on OSX / bash 4.2 |
@aes512 which part? Specifying JSON values on the command line or from a file? |
Looks like this was fixed in v1.3.1. If you're still having issues, please show a |
@jamesls - I can replicate this issue when an EC2 instance tag contains a unicode character (in my case a "ç". The problem is triggered when piping the output into awk/grep/etc:
2014-03-18 11:04:27,114 - botocore.hooks - DEBUG - Event needs-retry.ec2.DescribeInstances: calling handler <botocore.retryhandler.RetryHandler object at 0x1104b8cd0> 'ascii' codec can't encode character u'\xe7' in position 15: ordinal not in range(128) |
Err, this actually don't really apply to JSON output, rather text! (An issue regardless though, with botocore? ) |
I'm having the same issue when querying dynamodb for an attribute of an item which has non-ASCII characters:
aws --version: aws-cli/1.3.4 Python/2.7.6 Linux/2.6.18-164.el5 |
Since this issue is already marked as closed, opened a new issue: #853 |
I am try to update cloud front distribution using below cli command using powershell and Windows OS aws cloudfront update-distribution --id E**G --if-match ER --distributi I am getting same issue. Anybody can help..its uregnt.. |
When parsing JSON values from the command line, either literal or from a file, we are using the default ASCII encoding. If you try to specify any non-ASCII characters you will get an error like this:
I think UTF-8 would be a more reasonable default. Also, we should document that so customers know what we are expecting.
The text was updated successfully, but these errors were encountered: