-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Escape quotes and newlines when converting strings to json format in to_json #9612
Conversation
Signed-off-by: Andy Grove <[email protected]>
// this is a placeholder for implementing string escaping | ||
// https://github.com/NVIDIA/spark-rapids/issues/9514 | ||
cv | ||
withResource(cv) { _=> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original function did not close the input column, but this does. Are we sure this or the caller is doing the right thing here? Looks like there's an artificial incref occurring in the calling code that's confusing to follow. Seems like we do not want this method to close its input in practice, forcing callers to artifically incref to work around.
val chars = Seq("\\", "\"") | ||
val escaped = chars.map("\\" + _) | ||
val chars = Seq("\r", "\n", "\\", "\"") | ||
val escaped = Seq("\\r", "\\n", "\\\\", "\\\"") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: less hard-coding
val escaped = Seq("\\r", "\\n", "\\\\", "\\\"") | |
val escaped = chars.map(org.apache.commons.text.StringEscapeUtils.escapeJava) |
build |
.with_special_case('\'a\'') \ | ||
.with_special_case('\\\'a\\\''), | ||
pytest.param(StringGen('\u001a', nullable=True), marks=pytest.mark.xfail( | ||
reason='cuDF represents two-digit unicode characters in hex format such as \x1a')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a cudf issue to track this which we could link from here?
Maybe we can use something like this https://github.com/rapidsai/cudf/blob/f97e74f00b7a6bac37c9603def95a11b06cb013f/cpp/src/io/json/write_json.cu#L95-L108
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this is something that users will encounter - this is specific to escaping non-printable ASCII characters, and it would seem unusual to want to represent those in JSON. I wanted to note this issue to explain why we can't do a full StringGen in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gerashegalov I added a link to a follow-up issue #9705
build |
Closes #9514
This PR implements the
escapeJsonString
method which was previously just a placeholder.