-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
util/codec: add unquote function for json doc #1948
Conversation
PTAL @hicqu |
pub fn unquote(&self) -> Result<String> { | ||
match *self { | ||
Json::String(ref s) => unquote_string(s), | ||
_ => Ok(format!("{:?}", self)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use self.to_string()
}; | ||
match c { | ||
'"' => ret.push('"'), | ||
'b' => ret.push('\x08'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it more easy to understand if we make \x08
to a constant?
let utf8 = try!(decode_escaped_unicode(&unicode)); | ||
ret.push(utf8); | ||
} | ||
_ => ret.push(c), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we should push a \\
according tidb
https://github.com/pingcap/tidb/blob/master/util/types/json/functions.go#L122 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bug in TiDB. Confirmed with @hicqu already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in pingcap/tidb#3521.
// https://dev.mysql.com/doc/refman/5.7/en/json-modification-functions.html# | ||
// json-unquote-character-escape-sequences | ||
pub fn unquote_string(s: &str) -> Result<String> { | ||
let mut ret = String::with_capacity(s.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we do not use Vec<u8>
just like tidb
do https://github.com/pingcap/tidb/blob/master/util/types/json/functions.go#L89?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use raw u8 but not utf8 char here, there is an issue involved if there is a utf8 char in Json::String(String)
. For example, the utf8 恜苏
contains binary 0x60 0x5C 0x82 0xCF
. It should unquote to
mysql> SELECT JSON_UNQUOTE('"恜苏"');
+--------------------------------------+
| JSON_UNQUOTE('"恜苏"') |
+--------------------------------------+
| 恜苏 |
+--------------------------------------+
If the raw u8 array(same as string in Golang) is used to implement the unquote, the binary
0x60 0x5C 0x82 0xCF
would be converted to
0x60 0x82 0xCF
because 0x5C 0x82
is \ R
is unescaped to R
.
So the utf8 input is broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hicqu PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In tidb I use a byte buffer to store utf8 encoded bytes. Call String()
on that we can get a correct string, because go default string encoding format is utf8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
@@ -0,0 +1,260 @@ | |||
// Copyright 2017 PingCAP, Inc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this file to the PR #1949 which is the implementation for extract
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
address comment.
PTAL @AndreMouche @lishihai9017 @hicqu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ch == '\\' { | ||
let c = match chars.next() { | ||
Some(c) => c, | ||
None => return Err(box_err!("Missing a closing quotation mark in string")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
closing quotation mark seems misleading. since "\/bfnrt"
, "uXXXX
are not something like )
-closing.
Incomplete escape sequence
may be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message is coped from TiDB impl.
@hicqu We are gonna to change it according to the comment.
|
||
const CHAR_BACKSPACE: char = '\x08'; | ||
const CHAR_HORIZONTAL_TAB: char = '\x09'; | ||
const CHAR_LINE_FEED: char = '\x0A'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LINEFEED
const CHAR_BACKSPACE: char = '\x08'; | ||
const CHAR_HORIZONTAL_TAB: char = '\x09'; | ||
const CHAR_LINE_FEED: char = '\x0A'; | ||
const CHAR_FORM_FEED: char = '\x0C'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FORMFEED
'n' => ret.push(CHAR_LINE_FEED), | ||
'r' => ret.push(CHAR_CARRIAGE_RETURN), | ||
't' => ret.push(CHAR_HORIZONTAL_TAB), | ||
'\\' => ret.push('\\'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will we accept '/' as an escape char?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref: http://www.json.org/
ref: mysql command line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This MySQL docs says it clearly what escape sequences are supported.
https://dev.mysql.com/doc/refman/5.7/en/json-modification-functions.html#function_json-unquote
shall we copy the undocumented part? |
None => return Err(box_err!("Invalid unicode: {}", unicode)), | ||
} | ||
} | ||
let utf8 = try!(decode_escaped_unicode(&unicode)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we can't try!
, because if it's \uBADF
, we should raise the error instead of panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, try!()
will return an error but not panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, my fault.
't' => ret.push(CHAR_HORIZONTAL_TAB), | ||
'\\' => ret.push('\\'), | ||
'u' => { | ||
let mut unicode = String::with_capacity(ESCAPED_UNICODE_BYTES_SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we can use:
if let Ok(unicode) = str::from_utf8(chars.as_str().as_bytes()[0:4]) {
// logic here.
} else {
// error here.
}
for avoid using String
, which will alloc something on heap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and it may fail, need length check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function takes std::char
as operation unit, so that to avoid manipulate raw bytes(which could be encoded in utf8) directly. Is there any way to get a 4 char slice out of chars, or construct a slice on stack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If char boundary is break, the function will fail (length 4 check fails or from_utf8 fails), both are logically ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, check chars.as_str().as_bytes().len() >= 4
is also needed.
("\\\\", true, Some("\x5c")), | ||
("\\u597d", true, Some("好")), | ||
("0\\u597d0", true, Some("0好0")), | ||
("[", true, Some("[")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a \\a
here, and it should be unquoted as a
, which is compatible with MySQL document, but different from its implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mysql command line sucks.
mysql> SELECT JSON_UNQUOTE('"\\u597d"');
+---------------------------+
| JSON_UNQUOTE('"\\u597d"') |
+---------------------------+
| 好 |
+---------------------------+
1 row in set (0.00 sec)
mysql> SELECT JSON_UNQUOTE('"\\/"');
+-----------------------+
| JSON_UNQUOTE('"\\/"') |
+-----------------------+
| / |
+-----------------------+
1 row in set (0.00 sec)
mysql> SELECT JSON_UNQUOTE('"\\,"');
ERROR 3141 (22032): Invalid JSON text in argument 1 to function json_unquote: "Invalid escape character in string." at position 2.
mysql> SELECT JSON_UNQUOTE('"\\a"');
ERROR 3141 (22032): Invalid JSON text in argument 1 to function json_unquote: "Invalid escape character in string." at position 2.
mysql>
LGTM |
@hicqu PTAL |
LGTM. |
Hi,
This PR adds unquote function for json document type. It's one part of the PR #1940, rewritten to include
pingcap/tidb#3504
pingcap/tidb#3507
PTAL @AndreMouche @lishihai9017 @andelf @siddontang