-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Improve performance of from_json #10301
Comments
Breakdown of time spent in each major stage:
|
The expensive call in cleanAndConcat is this:
|
Do we have a breakdown on cleanAndConcat? Because I see all kinds of things that we could do better. And it is really complicated too. |
@revans2 I just posted that as you were making your comment |
|
I understand better what the issue is now. We start with string column containing all the JSON lines (after some cleanup that we have already performed). We then call val joined = withResource(cudf.Scalar.fromString("\n")) { lineSep =>
cleaned.joinStrings(lineSep, emptyRow)
} This is very fast. This does not append a final newline character to the document, and this caused issues with some tests where the final JSON line was empty or invalid (causing the error val concat = withResource(joined) { _ =>
withResource(ColumnVector.fromStrings("\n")) { newline =>
ColumnVector.stringConcatenate(Array[ColumnView](joined, newline))
}
} This is the cause of the performance issue. If I remove this part then the benchmark in this issue performs well on the GPU (slightly faster than CPU at least). I will experiment with other approaches here, and file an issue against cuDF with a simple repro if I can't find another solution. |
At a minimum we can write a kernel just for this. The original joinStrings is not that complicated. We could copy it and make it so it appends the trailing separator Or we could try and put up a patch to CUDF that lets us select if it happens or not. |
Is your feature request related to a problem? Please describe.
Using the following benchmark, I see that the performance of
from_json
on GPU can be 4x slower than native Spark on CPU.Generate JSON Data
from_json benchmark
Describe the solution you'd like
Improve performance.
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: