Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improve performance of from_json #10301

Closed
andygrove opened this issue Jan 26, 2024 · 7 comments · Fixed by #10306
Closed

[FEA] Improve performance of from_json #10301

andygrove opened this issue Jan 26, 2024 · 7 comments · Fixed by #10306
Assignees
Labels
performance A performance related task/issue

Comments

@andygrove
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Using the following benchmark, I see that the performance of from_json on GPU can be 4x slower than native Spark on CPU.

Generate JSON Data

import org.apache.spark.sql.SaveMode
val t1 = spark.read.parquet("/mnt/bigdata/tpcds/sf10-parquet/web_sales.parquet")
val df = t1.select(to_json(struct(t1.columns.map(col): _*)).alias("my_json"))
spark.time(df.write.mode(SaveMode.Overwrite).parquet("temp.parquet"))

from_json benchmark

import org.apache.spark.sql.SaveMode
val t1 = spark.read.parquet("/mnt/bigdata/tpcds/sf10-parquet/web_sales.parquet")
val t2 = spark.read.parquet("temp.parquet").repartition(16)
val df = t2.select(from_json(col("my_json"), t1.schema))

spark.conf.set("spark.rapids.sql.expression.JsonToStructs", true)
spark.time(df.write.mode(SaveMode.Overwrite).parquet("temp2.parquet"))

spark.conf.set("spark.rapids.sql.expression.JsonToStructs", false)
spark.time(df.write.mode(SaveMode.Overwrite).parquet("temp2.parquet"))

Describe the solution you'd like
Improve performance.

Describe alternatives you've considered

Additional context

@andygrove andygrove added feature request New feature or request ? - Needs Triage Need team to review and classify performance A performance related task/issue labels Jan 26, 2024
@andygrove andygrove self-assigned this Jan 26, 2024
@andygrove
Copy link
Contributor Author

Breakdown of time spent in each major stage:

PERF: cleanAndConcat=1376 millis
PERF: copyToHost=29 millis
PERF: readJSON=44 millis
PERF: cast=32 millis
PERF: postProcessing=17 millis

@andygrove
Copy link
Contributor Author

The expensive call in cleanAndConcat is this:

val concat = withResource(joined) { _ =>
  withResource(ColumnVector.fromStrings("\n")) { newline =>
    ColumnVector.stringConcatenate(Array[ColumnView](joined, newline))
  }
}

@revans2
Copy link
Collaborator

revans2 commented Jan 26, 2024

Do we have a breakdown on cleanAndConcat? Because I see all kinds of things that we could do better. And it is really complicated too.

@andygrove
Copy link
Contributor Author

@revans2 I just posted that as you were making your comment

@andygrove
Copy link
Contributor Author

PERF: cleanAndConcat: strip: 14 millis
PERF: cleanAndConcat: isNullOrEmpty: 9 millis
PERF: cleanAndConcat: literalNull: 32 millis
PERF: cleanAndConcat: checkNewline: 7 millis
PERF: cleanAndConcat: joinStrings: 1 millis
PERF: cleanAndConcat: stringConcatenate: 1275 millis

@andygrove
Copy link
Contributor Author

andygrove commented Jan 27, 2024

I understand better what the issue is now.

We start with string column containing all the JSON lines (after some cleanup that we have already performed). We then call joinStrings to create a string column containing a single string with each entry separated by a newline character.

val joined = withResource(cudf.Scalar.fromString("\n")) { lineSep =>
  cleaned.joinStrings(lineSep, emptyRow)
}

This is very fast.

This does not append a final newline character to the document, and this caused issues with some tests where the final JSON line was empty or invalid (causing the error The input data didn't parse correctly and we read a different number of rows than was expected. Expected 512, but got 511), so we have some code to add a final newline char:

val concat = withResource(joined) { _ =>
  withResource(ColumnVector.fromStrings("\n")) { newline =>
    ColumnVector.stringConcatenate(Array[ColumnView](joined, newline))
  }
}

This is the cause of the performance issue. If I remove this part then the benchmark in this issue performs well on the GPU (slightly faster than CPU at least).

I will experiment with other approaches here, and file an issue against cuDF with a simple repro if I can't find another solution.

@revans2
Copy link
Collaborator

revans2 commented Jan 29, 2024

At a minimum we can write a kernel just for this. The original joinStrings is not that complicated.

https://github.com/rapidsai/cudf/blob/d5db68e018d720094489325d5547a7ed82f22b0d/cpp/src/strings/combine/join.cu

We could copy it and make it so it appends the trailing separator

https://github.com/rapidsai/cudf/blob/d5db68e018d720094489325d5547a7ed82f22b0d/cpp/src/strings/combine/join.cu#L65

Or we could try and put up a patch to CUDF that lets us select if it happens or not.

https://github.com/rapidsai/cudf/blob/d5db68e018d720094489325d5547a7ed82f22b0d/cpp/src/strings/combine/join.cu#L65

@sameerz sameerz removed feature request New feature or request ? - Needs Triage Need team to review and classify labels Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants