Problem parsing JSON with UTF-8 content #593

garnaat · 2014-01-14T15:04:42Z

When parsing JSON values from the command line, either literal or from a file, we are using the default ASCII encoding. If you try to specify any non-ASCII characters you will get an error like this:

'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

I think UTF-8 would be a more reasonable default. Also, we should document that so customers know what we are expecting.

The text was updated successfully, but these errors were encountered:

garnaat · 2014-01-14T17:19:22Z

We can either assume a particular encoding and document it or we could add an --encoding option that allows the user to specify what the encoding is.

jamesls · 2014-01-14T17:54:49Z

From the command line, the user doesn't really have a choice right? It's whatever the encoding used by the terminal is so we can use the terminal's encoding to decode the json.

For the file case, given that these are JSON files, could we automatically detect the common unicode encodings? (utf-8/utf-16(-le|be)/utf-32). That should cover the common cases for linux/windows/mac. The rfc has a section on how to do this, though it doesn't look like anything in the json module automatically uses this: http://tools.ietf.org/search/rfc4627#section-3

garnaat · 2014-01-14T17:58:13Z

Auto-detecting sounds nice. I'm just wondering how hard we should try. For example, SJIS/MS932?

garnaat · 2014-01-14T18:02:59Z

Auto-detecting would be great. I didn't realize it was that straightforward.

jamesls · 2014-01-14T18:06:58Z

I think we could just autodetect/support utf-8/16/32 only. Given the json spec says that JSON should be encoded in unicode and calls out those three encodings, it might be a reasonable comprise to ask people to use one of those encodings (so assume utf-8/16/32).

I checked on a windows instance, and using notepad to save a file with unicode chars using the "Unicode" encoding in the dropdown uses utf-16 (with a BOM) so even if we just supported utf-8/utf-16 that would accommodate windows/linux/mac users.

garnaat · 2014-01-14T18:18:10Z

Well, the OP on the ticket was using a variety of encodings for Japanese characters (e.g. SJIS/MS932) which would probably involve utf-32 although I'm certainly no expert on that.

* release-0.31.0: (22 commits) Bumping version to 0.31.0 Remove debug logging message. Fix reference to no_auth. Allow for operations within a service to override the signature_version. Fixes #206. Supercedes #208 Fix setting socket timeout in py3 Add response parsing tests for S3 GetBucketLocation Expose output parameters matching root XML node, fix GetBucketLocation Use unittest2 on python2.6 Detect incomplete reads (content length mismatch) Simplifying code and fixing test to use unicode constant. Fixing an issue that came up while fixing aws/aws-cli#593. Fixing an issue that came up while fixing aws/aws-cli#593. Fix elastictranscoder service Add default param to get_config_variable Add session config vars for metadata retry/timeouts Add support for per session config vars Rename get_variable to get_config_variable Rename env vars to session vars Move module vars into session class vars Update elasticache model to the latest version ...

jamesls · 2014-02-21T05:55:20Z

Did more research on this. There's a few issues here.

First, in the case of unicode chars specified on the command line (and not in JSON files) then we should be decoding using whatever the encoding of the terminal is. We can use sys.stdout.encoding to get this value. Although, py3 seems to handle this case already:

(note the Value is utf-8 encoded in py2):

$ python2 -c "import sys; print(sys.argv)" aws ec2 create-tags --resource i-12345 --tags Key=tagfoo,Value=✓ --debug
['-c', 'aws', 'ec2', 'create-tags', '--resource', 'i-12345', '--tags', 'Key=tagfoo,Value=\xe2\x9c\x93', '--debug']

$ python3 -c "import sys; print(sys.argv)" aws ec2 create-tags --resource i-12345 --tags Key=tagfoo,Value=✓ --debug
['-c', 'aws', 'ec2', 'create-tags', '--resource', 'i-12345', '--tags', 'Key=tagfoo,Value=✓', '--debug']

For the case of files, on python3 it will automatically used the default encoding via locale.getpreferredencoding(False). On python2, we read the file as a byte string, but json.loads defaults its encoding to utf-8 (http://docs.python.org/2/library/json.html#json.load):

# In python2
json.loads(bytes('"Foo\xe2\x9c\x93"'))
u'Foo\u2713'

This means that we should already support utf-8 encoded JSON files referenced via file://. We can, as suggested in this issue, attempt to also support utf-16/utf-32, but utf-8 we already support.

We do need to fix the case where args are specified on the command line by decoding via sys.stdout.encoding on python2.

In python2, sys.argv is a bytestring of whatever encoding is used by the terminal. In python3, sys.argv is a list of unicode strings. This causes problems because the rest of the code assumes unicode. The fix is to automatically decode to unicode based on sys.stdin as soon as we parse the args. This was originally reported in aws#593, and boto/botocore#218. I'll need to more investigation to see if this problem applies to JSON files via file://, this commit only fixes the case where unicode is specified on the command line.

aes512 · 2014-03-16T03:10:12Z

This is still a problem in Python 2 (2.7.6) on OSX / bash 4.2

jamesls · 2014-03-18T17:32:25Z

@aes512 which part? Specifying JSON values on the command line or from a file?

jamesls · 2014-03-18T17:35:25Z

Looks like this was fixed in v1.3.1. If you're still having issues, please show a --debug log and I'd be happy to take another look.

aes512 · 2014-03-18T18:09:37Z

@jamesls - I can replicate this issue when an EC2 instance tag contains a unicode character (in my case a "ç". The problem is triggered when piping the output into awk/grep/etc:

`aws ec2 describe-instances --region us-west-1 --debug |  awk '/INSTANCES/ {print $9}'

2014-03-18 11:04:27,114 - botocore.hooks - DEBUG - Event needs-retry.ec2.DescribeInstances: calling handler <botocore.retryhandler.RetryHandler object at 0x1104b8cd0>
2014-03-18 11:04:27,115 - botocore.retryhandler - DEBUG - No retry needed.
2014-03-18 11:04:27,115 - botocore.hooks - DEBUG - Event after-call.ec2.DescribeInstances: calling handler <awscli.errorhandler.ErrorHandler object at 0x1100cd790>
2014-03-18 11:04:27,115 - awscli.errorhandler - DEBUG - HTTP Response Code: 200
m1.medium
2014-03-18 11:04:27,115 - awscli.clidriver - DEBUG - Exception caught in main()
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/awscli/clidriver.py", line 188, in main
return command_table[parsed_args.command](remaining, parsed_args)
File "/Library/Python/2.7/site-packages/awscli/clidriver.py", line 332, in call
return command_table[parsed_args.operation](remaining, parsed_globals)
File "/Library/Python/2.7/site-packages/awscli/clidriver.py", line 441, in call
self._operation_object, call_parameters, parsed_globals)
File "/Library/Python/2.7/site-packages/awscli/clidriver.py", line 529, in invoke
parsed_globals)
File "/Library/Python/2.7/site-packages/awscli/clidriver.py", line 550, in _display_response
formatter(operation, response)
File "/Library/Python/2.7/site-packages/awscli/formatter.py", line 224, in call
self._format_response(current, stream)
File "/Library/Python/2.7/site-packages/awscli/formatter.py", line 243, in _format_response
text.format_text(response, stream)
File "/Library/Python/2.7/site-packages/awscli/text.py", line 17, in format_text
_format_text(data, stream)
File "/Library/Python/2.7/site-packages/awscli/text.py", line 30, in _format_text
identifier=new_identifier)
File "/Library/Python/2.7/site-packages/awscli/text.py", line 39, in _format_text
scalar_keys=all_keys)
File "/Library/Python/2.7/site-packages/awscli/text.py", line 30, in _format_text
identifier=new_identifier)
File "/Library/Python/2.7/site-packages/awscli/text.py", line 39, in _format_text
scalar_keys=all_keys)
File "/Library/Python/2.7/site-packages/awscli/text.py", line 30, in _format_text
identifier=new_identifier)
File "/Library/Python/2.7/site-packages/awscli/text.py", line 39, in _format_text
scalar_keys=all_keys)
File "/Library/Python/2.7/site-packages/awscli/text.py", line 26, in _format_text
stream.write('\t'.join(scalars))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 15: ordinal not in range(128)
2014-03-18 11:04:27,117 - awscli.clidriver - DEBUG - Exiting with rc 255

'ascii' codec can't encode character u'\xe7' in position 15: ordinal not in range(128)

aes512 · 2014-03-18T18:11:38Z

Err, this actually don't really apply to JSON output, rather text! (An issue regardless though, with botocore? )

sbressler · 2014-07-25T20:23:53Z

I'm having the same issue when querying dynamodb for an attribute of an item which has non-ASCII characters:

% aws dynamodb query --table-name <table> --select "<key>" --attributes-to-get <attribute> --debug
...
2014-07-25 13:20:37,038 - awscli.errorhandler - DEBUG - HTTP Response Code: 400
2014-07-25 13:20:37,038 - awscli.clidriver - DEBUG - Exception caught in main()
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 1304: ordinal not in range(128)
2014-07-25 13:20:37,041 - awscli.clidriver - DEBUG - Exiting with rc 255

'ascii' codec can't encode character u'\u2026' in position 1304: ordinal not in range(128)

aws --version: aws-cli/1.3.4 Python/2.7.6 Linux/2.6.18-164.el5

sbressler · 2014-07-25T20:59:10Z

Since this issue is already marked as closed, opened a new issue: #853

Rakesh201180 · 2018-02-21T07:04:23Z

I am try to update cloud front distribution using below cli command using powershell and Windows OS

aws cloudfront update-distribution --id E**G --if-match ER --distributi
on-config file://test1.json

I am getting same issue.

Anybody can help..its uregnt..

garnaat added a commit to garnaat/botocore that referenced this issue Jan 14, 2014

Fixing an issue that came up while fixing aws/aws-cli#593.

c47f501

garnaat added a commit to garnaat/botocore that referenced this issue Jan 14, 2014

Fixing an issue that came up while fixing aws/aws-cli#593.

64aef7a

ainoya mentioned this issue Jan 23, 2014

fix: build_parameter_query can't decode utf-8 charcters in parameters.py boto/botocore#218

Closed

jamesls mentioned this issue Feb 28, 2014

Fix unicode argument processing for py2 #679

Merged

jamesls closed this as completed Mar 18, 2014

sbressler mentioned this issue Jul 25, 2014

Unicode characters cause dynamodb get-item to fail #853

Closed

jmonson mentioned this issue Oct 15, 2014

Dynamodb get-item error: Codec can't decode byte #950

Closed

kdaily added the unicode label Sep 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem parsing JSON with UTF-8 content #593

Problem parsing JSON with UTF-8 content #593

garnaat commented Jan 14, 2014

garnaat commented Jan 14, 2014

jamesls commented Jan 14, 2014

garnaat commented Jan 14, 2014

garnaat commented Jan 14, 2014

jamesls commented Jan 14, 2014

garnaat commented Jan 14, 2014

jamesls commented Feb 21, 2014

aes512 commented Mar 16, 2014

jamesls commented Mar 18, 2014

jamesls commented Mar 18, 2014

aes512 commented Mar 18, 2014

aes512 commented Mar 18, 2014

sbressler commented Jul 25, 2014

sbressler commented Jul 25, 2014

Rakesh201180 commented Feb 21, 2018

Problem parsing JSON with UTF-8 content #593

Problem parsing JSON with UTF-8 content #593

Comments

garnaat commented Jan 14, 2014

garnaat commented Jan 14, 2014

jamesls commented Jan 14, 2014

garnaat commented Jan 14, 2014

garnaat commented Jan 14, 2014

jamesls commented Jan 14, 2014

garnaat commented Jan 14, 2014

jamesls commented Feb 21, 2014

aes512 commented Mar 16, 2014

jamesls commented Mar 18, 2014

jamesls commented Mar 18, 2014

aes512 commented Mar 18, 2014

aes512 commented Mar 18, 2014

sbressler commented Jul 25, 2014

sbressler commented Jul 25, 2014

Rakesh201180 commented Feb 21, 2018