Initial entry point to data generation for scale test #9054

wjxiz1992 · 2023-08-16T10:24:38Z

As titled.

This PR aims to provide the initial entry point to the data generation application for scale test.
The design and user interface are described at #8813 (comment)

still DRAFT version, posted for early review and feedbacks.

One example command to test it locally:

$SPARK_HOME/bin/spark-submit \
--master spark://*:7077 \
--conf spark.driver.memory=10G \
--conf spark.executor.memory=32G \
--conf spark.sql.parquet.int96RebaseModeInWrite=CORRECTED \
--conf spark.sql.parquet.datetimeRebaseModeInWrite=CORRECTED \
--class com.nvidia.rapids.tests.scaletest.ScaleTestDataGen \
--jars $SPARK_HOME/examples/jars/scopt_2.12-3.7.1.jar \
./target/datagen_2.12-23.10.0-SNAPSHOT-spark332.jar \
1 \
10 \
parquet \
file:/*/testdata

Giving an example to show the actual disk size the data will take so we have basic impression:
For Scale=1, Complexity=1 and parquet file:

2.2M    a_facts
115M    b_data
28M     c_data
26M     d_data
282M    e_data
296K    f_facts
150M    g_data

For Scale=1, Complexity=10 and parquet file:

2.7M    a_facts
302M    b_data
21M     c_data
57M     d_data
295M    e_data
584K    f_facts
150M    g_data

For Scale=10, Complexity=10 and parquet file:

27M     a_facts
3.0G    b_data
295M    c_data
655M    d_data
2.9G    e_data
4.7M    f_facts
1.5G    g_data

Basic code structure
README doc for how to use it
Tests for data and scale queries

Signed-off-by: Allen Xu <[email protected]>

datagen/pom.xml

Signed-off-by: Allen Xu <[email protected]>

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/TableGenerator.scala

datagen/README.md

datagen/pom.xml

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/DataGenEntry.scala

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/TableGenerator.scala

datagen/pom.xml

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/TableGenerator.scala

Signed-off-by: Allen Xu <[email protected]>

wjxiz1992

resolve comments and add CorrelatedKeyGroup for key groups in tables.

datagen/README.md

datagen/pom.xml

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/DataGenEntry.scala

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/TableGenerator.scala

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/DataGenEntry.scala

Signed-off-by: Allen Xu <[email protected]>

revans2

Mostly looks good. Just a few nits

NOTICE

datagen/README.md

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/ScaleTestDataGen.scala

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/TableGenerator.scala

Signed-off-by: Allen Xu <[email protected]>

wjxiz1992

resolve more comments.

NOTICE

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/ScaleTestDataGen.scala

datagen/README.md

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/ScaleTestDataGen.scala

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/TableGenerator.scala

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/DataGenEntry.scala

revans2

Thanks for your patience in doing the rework. I think we are really close now and it looks really good.

Thanks

datagen/ScaleTest.md

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/ScaleTestDataGen.scala

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/TableGenerator.scala

Signed-off-by: Allen Xu <[email protected]>

wjxiz1992

Thanks for review! add a flag arg "--overwrite" and resolve the rest comments.

datagen/ScaleTest.md

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/ScaleTestDataGen.scala

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/TableGenerator.scala

revans2

ScalaTest.md appears to also have some TODOs in it. Are there plans to fix that?

datagen/ScaleTest.md

wjxiz1992 · 2023-08-19T06:59:33Z

it. Are there plans to fix that?

Yes, in this PR we only add the data generation part, but not include the queries to run with them.

I put a TODO part in Test Query Sets sections to deal with the up coming work for #8814 and #8816 so that this scale tset is a complete integral test suite.

Signed-off-by: Allen Xu <[email protected]>

revans2 · 2023-08-21T15:26:10Z

build

* Init entry point to data generation for scale test Signed-off-by: Allen Xu <[email protected]> * add date range Signed-off-by: Allen Xu <[email protected]> * add correlatedKeyGroup settings for key groups Signed-off-by: Allen Xu <[email protected]> * refine NOTICE file Signed-off-by: Allen Xu <[email protected]> * shorten the code when dealing with data format Signed-off-by: Allen Xu <[email protected]> * resolve more comments and add doc for usage Signed-off-by: Allen Xu <[email protected]> * add --overwrite argument and resolve some comments Signed-off-by: Allen Xu <[email protected]> * style update,unblock CI Signed-off-by: Allen Xu <[email protected]> --------- Signed-off-by: Allen Xu <[email protected]>

Init entry point to data generation for scale test

865dd18

Signed-off-by: Allen Xu <[email protected]>

wjxiz1992 self-assigned this Aug 16, 2023

wjxiz1992 requested review from jlowe, revans2, tgravescs, GaryShen2008, NvTimLiu and zhanga5 as code owners August 16, 2023 10:24

wjxiz1992 marked this pull request as draft August 16, 2023 10:25

wjxiz1992 commented Aug 16, 2023

View reviewed changes

datagen/pom.xml Show resolved Hide resolved

add date range

c5ec398

Signed-off-by: Allen Xu <[email protected]>

wjxiz1992 commented Aug 16, 2023

View reviewed changes

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/TableGenerator.scala Outdated Show resolved Hide resolved

revans2 reviewed Aug 16, 2023

View reviewed changes

GaryShen2008 reviewed Aug 17, 2023

View reviewed changes

datagen/pom.xml Show resolved Hide resolved

datagen/src/main/scala/com/nvidia/rapids/tests/scaletest/TableGenerator.scala Show resolved Hide resolved

add correlatedKeyGroup settings for key groups

038d42b

Signed-off-by: Allen Xu <[email protected]>

wjxiz1992 commented Aug 17, 2023

View reviewed changes

wjxiz1992 added 2 commits August 17, 2023 18:36

refine NOTICE file

5d9f973

Signed-off-by: Allen Xu <[email protected]>

shorten the code when dealing with data format

d94d089

Signed-off-by: Allen Xu <[email protected]>

revans2 reviewed Aug 17, 2023

View reviewed changes

sameerz added the data gen label Aug 17, 2023

resolve more comments and add doc for usage

0fc3f42

Signed-off-by: Allen Xu <[email protected]>

wjxiz1992 commented Aug 18, 2023

View reviewed changes

revans2 reviewed Aug 18, 2023

View reviewed changes

wjxiz1992 marked this pull request as ready for review August 18, 2023 14:28

add --overwrite argument and resolve some comments

28f6c94

Signed-off-by: Allen Xu <[email protected]>

wjxiz1992 commented Aug 18, 2023

View reviewed changes

wjxiz1992 requested a review from revans2 August 18, 2023 15:35

revans2 reviewed Aug 18, 2023

View reviewed changes

datagen/ScaleTest.md Show resolved Hide resolved

style update,unblock CI

39cb6be

Signed-off-by: Allen Xu <[email protected]>

revans2 approved these changes Aug 21, 2023

View reviewed changes

wjxiz1992 merged commit 9d64c89 into NVIDIA:branch-23.10 Aug 22, 2023
26 of 27 checks passed

wjxiz1992 mentioned this pull request Aug 23, 2023

Add application to run Scale Test #9089

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial entry point to data generation for scale test #9054

Initial entry point to data generation for scale test #9054

wjxiz1992 commented Aug 16, 2023 •

edited

Loading

wjxiz1992 left a comment

revans2 left a comment

wjxiz1992 left a comment

revans2 left a comment

wjxiz1992 left a comment

revans2 left a comment

wjxiz1992 commented Aug 19, 2023

revans2 commented Aug 21, 2023

Initial entry point to data generation for scale test #9054

Initial entry point to data generation for scale test #9054

Conversation

wjxiz1992 commented Aug 16, 2023 • edited Loading

wjxiz1992 left a comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

wjxiz1992 left a comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

wjxiz1992 left a comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

wjxiz1992 commented Aug 19, 2023

revans2 commented Aug 21, 2023

wjxiz1992 commented Aug 16, 2023 •

edited

Loading