Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem parsing JSON with UTF-8 content #593

Closed
garnaat opened this issue Jan 14, 2014 · 15 comments
Closed

Problem parsing JSON with UTF-8 content #593

garnaat opened this issue Jan 14, 2014 · 15 comments
Labels

Comments

@garnaat
Copy link
Contributor

garnaat commented Jan 14, 2014

When parsing JSON values from the command line, either literal or from a file, we are using the default ASCII encoding. If you try to specify any non-ASCII characters you will get an error like this:

'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) 

I think UTF-8 would be a more reasonable default. Also, we should document that so customers know what we are expecting.

garnaat added a commit to garnaat/botocore that referenced this issue Jan 14, 2014
@garnaat
Copy link
Contributor Author

garnaat commented Jan 14, 2014

We can either assume a particular encoding and document it or we could add an --encoding option that allows the user to specify what the encoding is.

@jamesls
Copy link
Member

jamesls commented Jan 14, 2014

From the command line, the user doesn't really have a choice right? It's whatever the encoding used by the terminal is so we can use the terminal's encoding to decode the json.

For the file case, given that these are JSON files, could we automatically detect the common unicode encodings? (utf-8/utf-16(-le|be)/utf-32). That should cover the common cases for linux/windows/mac. The rfc has a section on how to do this, though it doesn't look like anything in the json module automatically uses this: http://tools.ietf.org/search/rfc4627#section-3

@garnaat
Copy link
Contributor Author

garnaat commented Jan 14, 2014

Auto-detecting sounds nice. I'm just wondering how hard we should try. For example, SJIS/MS932?

@garnaat
Copy link
Contributor Author

garnaat commented Jan 14, 2014

Auto-detecting would be great. I didn't realize it was that straightforward.

@jamesls
Copy link
Member

jamesls commented Jan 14, 2014

I think we could just autodetect/support utf-8/16/32 only. Given the json spec says that JSON should be encoded in unicode and calls out those three encodings, it might be a reasonable comprise to ask people to use one of those encodings (so assume utf-8/16/32).

I checked on a windows instance, and using notepad to save a file with unicode chars using the "Unicode" encoding in the dropdown uses utf-16 (with a BOM) so even if we just supported utf-8/utf-16 that would accommodate windows/linux/mac users.

garnaat added a commit to garnaat/botocore that referenced this issue Jan 14, 2014
@garnaat
Copy link
Contributor Author

garnaat commented Jan 14, 2014

Well, the OP on the ticket was using a variety of encodings for Japanese characters (e.g. SJIS/MS932) which would probably involve utf-32 although I'm certainly no expert on that.

jamesls added a commit to boto/botocore that referenced this issue Jan 23, 2014
* release-0.31.0: (22 commits)
  Bumping version to 0.31.0
  Remove debug logging message.
  Fix reference to no_auth.
  Allow for operations within a service to override the signature_version.  Fixes #206.  Supercedes #208
  Fix setting socket timeout in py3
  Add response parsing tests for S3 GetBucketLocation
  Expose output parameters matching root XML node, fix GetBucketLocation
  Use unittest2 on python2.6
  Detect incomplete reads (content length mismatch)
  Simplifying code and fixing test to use unicode constant.
  Fixing an issue that came up while fixing aws/aws-cli#593.
  Fixing an issue that came up while fixing aws/aws-cli#593.
  Fix elastictranscoder service
  Add default param to get_config_variable
  Add session config vars for metadata retry/timeouts
  Add support for per session config vars
  Rename get_variable to get_config_variable
  Rename env vars to session vars
  Move module vars into session class vars
  Update elasticache model to the latest version
  ...
@jamesls
Copy link
Member

jamesls commented Feb 21, 2014

Did more research on this. There's a few issues here.

First, in the case of unicode chars specified on the command line (and not in JSON files) then we should be decoding using whatever the encoding of the terminal is. We can use sys.stdout.encoding to get this value. Although, py3 seems to handle this case already:

(note the Value is utf-8 encoded in py2):

$ python2 -c "import sys; print(sys.argv)" aws ec2 create-tags --resource i-12345 --tags Key=tagfoo,Value=✓ --debug
['-c', 'aws', 'ec2', 'create-tags', '--resource', 'i-12345', '--tags', 'Key=tagfoo,Value=\xe2\x9c\x93', '--debug']

$ python3 -c "import sys; print(sys.argv)" aws ec2 create-tags --resource i-12345 --tags Key=tagfoo,Value=✓ --debug
['-c', 'aws', 'ec2', 'create-tags', '--resource', 'i-12345', '--tags', 'Key=tagfoo,Value=✓', '--debug']

For the case of files, on python3 it will automatically used the default encoding via locale.getpreferredencoding(False). On python2, we read the file as a byte string, but json.loads defaults its encoding to utf-8 (http://docs.python.org/2/library/json.html#json.load):

# In python2
json.loads(bytes('"Foo\xe2\x9c\x93"'))
u'Foo\u2713'

This means that we should already support utf-8 encoded JSON files referenced via file://. We can, as suggested in this issue, attempt to also support utf-16/utf-32, but utf-8 we already support.

We do need to fix the case where args are specified on the command line by decoding via sys.stdout.encoding on python2.

jamesls added a commit to jamesls/aws-cli that referenced this issue Feb 28, 2014
In python2, sys.argv is a bytestring of whatever encoding
is used by the terminal.  In python3, sys.argv is a list of unicode
strings.  This causes problems because the rest of the code assumes
unicode.

The fix is to automatically decode to unicode based on sys.stdin
as soon as we parse the args.

This was originally reported in aws#593, and
boto/botocore#218.

I'll need to more investigation to see if this problem applies
to JSON files via file://, this commit only fixes the case where
unicode is specified on the command line.
@aes512
Copy link

aes512 commented Mar 16, 2014

This is still a problem in Python 2 (2.7.6) on OSX / bash 4.2

@jamesls
Copy link
Member

jamesls commented Mar 18, 2014

@aes512 which part? Specifying JSON values on the command line or from a file?

@jamesls
Copy link
Member

jamesls commented Mar 18, 2014

Looks like this was fixed in v1.3.1. If you're still having issues, please show a --debug log and I'd be happy to take another look.

@jamesls jamesls closed this as completed Mar 18, 2014
@aes512
Copy link

aes512 commented Mar 18, 2014

@jamesls - I can replicate this issue when an EC2 instance tag contains a unicode character (in my case a "ç". The problem is triggered when piping the output into awk/grep/etc:

`aws ec2 describe-instances --region us-west-1 --debug |  awk '/INSTANCES/ {print $9}'

2014-03-18 11:04:27,114 - botocore.hooks - DEBUG - Event needs-retry.ec2.DescribeInstances: calling handler <botocore.retryhandler.RetryHandler object at 0x1104b8cd0>
2014-03-18 11:04:27,115 - botocore.retryhandler - DEBUG - No retry needed.
2014-03-18 11:04:27,115 - botocore.hooks - DEBUG - Event after-call.ec2.DescribeInstances: calling handler <awscli.errorhandler.ErrorHandler object at 0x1100cd790>
2014-03-18 11:04:27,115 - awscli.errorhandler - DEBUG - HTTP Response Code: 200
m1.medium
2014-03-18 11:04:27,115 - awscli.clidriver - DEBUG - Exception caught in main()
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/awscli/clidriver.py", line 188, in main
return command_table[parsed_args.command](remaining, parsed_args)
File "/Library/Python/2.7/site-packages/awscli/clidriver.py", line 332, in call
return command_table[parsed_args.operation](remaining, parsed_globals)
File "/Library/Python/2.7/site-packages/awscli/clidriver.py", line 441, in call
self._operation_object, call_parameters, parsed_globals)
File "/Library/Python/2.7/site-packages/awscli/clidriver.py", line 529, in invoke
parsed_globals)
File "/Library/Python/2.7/site-packages/awscli/clidriver.py", line 550, in _display_response
formatter(operation, response)
File "/Library/Python/2.7/site-packages/awscli/formatter.py", line 224, in call
self._format_response(current, stream)
File "/Library/Python/2.7/site-packages/awscli/formatter.py", line 243, in _format_response
text.format_text(response, stream)
File "/Library/Python/2.7/site-packages/awscli/text.py", line 17, in format_text
_format_text(data, stream)
File "/Library/Python/2.7/site-packages/awscli/text.py", line 30, in _format_text
identifier=new_identifier)
File "/Library/Python/2.7/site-packages/awscli/text.py", line 39, in _format_text
scalar_keys=all_keys)
File "/Library/Python/2.7/site-packages/awscli/text.py", line 30, in _format_text
identifier=new_identifier)
File "/Library/Python/2.7/site-packages/awscli/text.py", line 39, in _format_text
scalar_keys=all_keys)
File "/Library/Python/2.7/site-packages/awscli/text.py", line 30, in _format_text
identifier=new_identifier)
File "/Library/Python/2.7/site-packages/awscli/text.py", line 39, in _format_text
scalar_keys=all_keys)
File "/Library/Python/2.7/site-packages/awscli/text.py", line 26, in _format_text
stream.write('\t'.join(scalars))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 15: ordinal not in range(128)
2014-03-18 11:04:27,117 - awscli.clidriver - DEBUG - Exiting with rc 255

'ascii' codec can't encode character u'\xe7' in position 15: ordinal not in range(128)

@aes512
Copy link

aes512 commented Mar 18, 2014

Err, this actually don't really apply to JSON output, rather text! (An issue regardless though, with botocore? )

@sbressler
Copy link

I'm having the same issue when querying dynamodb for an attribute of an item which has non-ASCII characters:

% aws dynamodb query --table-name <table> --select "<key>" --attributes-to-get <attribute> --debug
...
2014-07-25 13:20:37,038 - awscli.errorhandler - DEBUG - HTTP Response Code: 400
2014-07-25 13:20:37,038 - awscli.clidriver - DEBUG - Exception caught in main()
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 1304: ordinal not in range(128)
2014-07-25 13:20:37,041 - awscli.clidriver - DEBUG - Exiting with rc 255

'ascii' codec can't encode character u'\u2026' in position 1304: ordinal not in range(128)

aws --version: aws-cli/1.3.4 Python/2.7.6 Linux/2.6.18-164.el5

@sbressler
Copy link

Since this issue is already marked as closed, opened a new issue: #853

@Rakesh201180
Copy link

I am try to update cloud front distribution using below cli command using powershell and Windows OS

aws cloudfront update-distribution --id E**G --if-match ER --distributi
on-config file://test1.json

I am getting same issue.

Anybody can help..its uregnt..

@kdaily kdaily added the unicode label Sep 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants