Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util/codec: add unquote function for json doc #1948

Merged
merged 11 commits into from
Jun 22, 2017

Conversation

hhkbp2
Copy link
Contributor

@hhkbp2 hhkbp2 commented Jun 20, 2017

Hi,

This PR adds unquote function for json document type. It's one part of the PR #1940, rewritten to include
pingcap/tidb#3504
pingcap/tidb#3507

PTAL @AndreMouche @lishihai9017 @andelf @siddontang

@hhkbp2
Copy link
Contributor Author

hhkbp2 commented Jun 20, 2017

PTAL @hicqu

pub fn unquote(&self) -> Result<String> {
match *self {
Json::String(ref s) => unquote_string(s),
_ => Ok(format!("{:?}", self)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self.to_string()

};
match c {
'"' => ret.push('"'),
'b' => ret.push('\x08'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it more easy to understand if we make \x08 to a constant?

let utf8 = try!(decode_escaped_unicode(&unicode));
ret.push(utf8);
}
_ => ret.push(c),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bug in TiDB. Confirmed with @hicqu already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in pingcap/tidb#3521.

// https://dev.mysql.com/doc/refman/5.7/en/json-modification-functions.html#
// json-unquote-character-escape-sequences
pub fn unquote_string(s: &str) -> Result<String> {
let mut ret = String::with_capacity(s.len());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@hhkbp2 hhkbp2 Jun 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use raw u8 but not utf8 char here, there is an issue involved if there is a utf8 char in Json::String(String). For example, the utf8 恜苏 contains binary 0x60 0x5C 0x82 0xCF. It should unquote to

mysql> SELECT JSON_UNQUOTE('"恜苏"');
+--------------------------------------+
| JSON_UNQUOTE('"恜苏"')               |
+--------------------------------------+
| 恜苏                                  |
+--------------------------------------+

If the raw u8 array(same as string in Golang) is used to implement the unquote, the binary

0x60 0x5C 0x82 0xCF

would be converted to

0x60 0x82 0xCF

because 0x5C 0x82 is \ R is unescaped to R.

So the utf8 input is broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hicqu PTAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In tidb I use a byte buffer to store utf8 encoded bytes. Call String() on that we can get a correct string, because go default string encoding format is utf8.

Copy link
Member

@AndreMouche AndreMouche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

@@ -0,0 +1,260 @@
// Copyright 2017 PingCAP, Inc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this file to the PR #1949 which is the implementation for extract?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

address comment.

@hhkbp2
Copy link
Contributor Author

hhkbp2 commented Jun 21, 2017

PTAL @AndreMouche @lishihai9017 @hicqu

Copy link
Member

@AndreMouche AndreMouche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @hicqu @lishihai9017 @andelf PTAL

if ch == '\\' {
let c = match chars.next() {
Some(c) => c,
None => return Err(box_err!("Missing a closing quotation mark in string")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

closing quotation mark seems misleading. since "\/bfnrt", "uXXXX are not something like )-closing.
Incomplete escape sequence may be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message is coped from TiDB impl.
@hicqu We are gonna to change it according to the comment.


const CHAR_BACKSPACE: char = '\x08';
const CHAR_HORIZONTAL_TAB: char = '\x09';
const CHAR_LINE_FEED: char = '\x0A';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LINEFEED

const CHAR_BACKSPACE: char = '\x08';
const CHAR_HORIZONTAL_TAB: char = '\x09';
const CHAR_LINE_FEED: char = '\x0A';
const CHAR_FORM_FEED: char = '\x0C';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FORMFEED

'n' => ret.push(CHAR_LINE_FEED),
'r' => ret.push(CHAR_CARRIAGE_RETURN),
't' => ret.push(CHAR_HORIZONTAL_TAB),
'\\' => ret.push('\\'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we accept '/' as an escape char?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref: http://www.json.org/
ref: mysql command line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This MySQL docs says it clearly what escape sequences are supported.
https://dev.mysql.com/doc/refman/5.7/en/json-modification-functions.html#function_json-unquote

@andelf
Copy link
Contributor

andelf commented Jun 21, 2017

mysql> SELECT JSON_UNQUOTE('"\\/"');
+-----------------------+
| JSON_UNQUOTE('"\\/"') |
+-----------------------+
| /                     |
+-----------------------+
1 row in set (0.00 sec)

shall we copy the undocumented part?

None => return Err(box_err!("Invalid unicode: {}", unicode)),
}
}
let utf8 = try!(decode_escaped_unicode(&unicode));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we can't try!, because if it's \uBADF, we should raise the error instead of panic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, try!() will return an error but not panic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, my fault.

't' => ret.push(CHAR_HORIZONTAL_TAB),
'\\' => ret.push('\\'),
'u' => {
let mut unicode = String::with_capacity(ESCAPED_UNICODE_BYTES_SIZE);
Copy link
Contributor

@hicqu hicqu Jun 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we can use:

if let Ok(unicode) = str::from_utf8(chars.as_str().as_bytes()[0:4]) {
    // logic here.
} else {
    // error here.
}

for avoid using String, which will alloc something on heap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it may fail, need length check?

Copy link
Contributor Author

@hhkbp2 hhkbp2 Jun 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function takes std::char as operation unit, so that to avoid manipulate raw bytes(which could be encoded in utf8) directly. Is there any way to get a 4 char slice out of chars, or construct a slice on stack?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If char boundary is break, the function will fail (length 4 check fails or from_utf8 fails), both are logically ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, check chars.as_str().as_bytes().len() >= 4 is also needed.

("\\\\", true, Some("\x5c")),
("\\u597d", true, Some("好")),
("0\\u597d0", true, Some("0好0")),
("[", true, Some("[")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a \\a here, and it should be unquoted as a, which is compatible with MySQL document, but different from its implement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mysql command line sucks.

mysql> SELECT JSON_UNQUOTE('"\\u597d"');
+---------------------------+
| JSON_UNQUOTE('"\\u597d"') |
+---------------------------+
| 好                        |
+---------------------------+
1 row in set (0.00 sec)

mysql> SELECT JSON_UNQUOTE('"\\/"');
+-----------------------+
| JSON_UNQUOTE('"\\/"') |
+-----------------------+
| /                     |
+-----------------------+
1 row in set (0.00 sec)

mysql> SELECT JSON_UNQUOTE('"\\,"');
ERROR 3141 (22032): Invalid JSON text in argument 1 to function json_unquote: "Invalid escape character in string." at position 2.
mysql> SELECT JSON_UNQUOTE('"\\a"');
ERROR 3141 (22032): Invalid JSON text in argument 1 to function json_unquote: "Invalid escape character in string." at position 2.
mysql>

@hhkbp2
Copy link
Contributor Author

hhkbp2 commented Jun 21, 2017

PTAL @hicqu @andelf

@andelf
Copy link
Contributor

andelf commented Jun 21, 2017

LGTM

@AndreMouche
Copy link
Member

@hicqu PTAL

@hicqu
Copy link
Contributor

hicqu commented Jun 22, 2017

LGTM.

@AndreMouche AndreMouche merged commit 40ab236 into master Jun 22, 2017
@AndreMouche AndreMouche deleted the hhkbp2/jsondoc-add-functions branch June 22, 2017 05:04
@sre-bot sre-bot added the contribution This PR is from a community contributor. label Dec 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contribution This PR is from a community contributor.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants