Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial entry point to data generation for scale test #9054

Merged
merged 8 commits into from
Aug 22, 2023

Conversation

wjxiz1992
Copy link
Collaborator

@wjxiz1992 wjxiz1992 commented Aug 16, 2023

As titled.

close #8813

This PR aims to provide the initial entry point to the data generation application for scale test.
The design and user interface are described at #8813 (comment)

still DRAFT version, posted for early review and feedbacks.

One example command to test it locally:

$SPARK_HOME/bin/spark-submit \
--master spark://*:7077 \
--conf spark.driver.memory=10G \
--conf spark.executor.memory=32G \
--conf spark.sql.parquet.int96RebaseModeInWrite=CORRECTED \
--conf spark.sql.parquet.datetimeRebaseModeInWrite=CORRECTED \
--class com.nvidia.rapids.tests.scaletest.ScaleTestDataGen \
--jars $SPARK_HOME/examples/jars/scopt_2.12-3.7.1.jar \
./target/datagen_2.12-23.10.0-SNAPSHOT-spark332.jar \
1 \
10 \
parquet \
file:/*/testdata

Giving an example to show the actual disk size the data will take so we have basic impression:
For Scale=1, Complexity=1 and parquet file:

2.2M    a_facts
115M    b_data
28M     c_data
26M     d_data
282M    e_data
296K    f_facts
150M    g_data

For Scale=1, Complexity=10 and parquet file:

2.7M    a_facts
302M    b_data
21M     c_data
57M     d_data
295M    e_data
584K    f_facts
150M    g_data

For Scale=10, Complexity=10 and parquet file:

27M     a_facts
3.0G    b_data
295M    c_data
655M    d_data
2.9G    e_data
4.7M    f_facts
1.5G    g_data

  • Basic code structure
  • README doc for how to use it
  • Tests for data and scale queries

@wjxiz1992 wjxiz1992 self-assigned this Aug 16, 2023
@wjxiz1992 wjxiz1992 marked this pull request as draft August 16, 2023 10:25
Signed-off-by: Allen Xu <[email protected]>
datagen/README.md Outdated Show resolved Hide resolved
datagen/README.md Outdated Show resolved Hide resolved
datagen/pom.xml Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@wjxiz1992 wjxiz1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve comments and add CorrelatedKeyGroup for key groups in tables.

datagen/README.md Outdated Show resolved Hide resolved
datagen/README.md Outdated Show resolved Hide resolved
datagen/pom.xml Show resolved Hide resolved
datagen/pom.xml Outdated Show resolved Hide resolved
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. Just a few nits

Copy link
Collaborator Author

@wjxiz1992 wjxiz1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve more comments.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience in doing the rework. I think we are really close now and it looks really good.

Thanks

datagen/ScaleTest.md Outdated Show resolved Hide resolved
datagen/ScaleTest.md Show resolved Hide resolved
datagen/ScaleTest.md Outdated Show resolved Hide resolved
@wjxiz1992 wjxiz1992 marked this pull request as ready for review August 18, 2023 14:28
Copy link
Collaborator Author

@wjxiz1992 wjxiz1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review! add a flag arg "--overwrite" and resolve the rest comments.

datagen/ScaleTest.md Outdated Show resolved Hide resolved
datagen/ScaleTest.md Show resolved Hide resolved
datagen/ScaleTest.md Outdated Show resolved Hide resolved
@wjxiz1992 wjxiz1992 requested a review from revans2 August 18, 2023 15:35
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScalaTest.md appears to also have some TODOs in it. Are there plans to fix that?

datagen/ScaleTest.md Show resolved Hide resolved
@wjxiz1992
Copy link
Collaborator Author

it. Are there plans to fix that?

Yes, in this PR we only add the data generation part, but not include the queries to run with them.

I put a TODO part in Test Query Sets sections to deal with the up coming work for #8814 and #8816 so that this scale tset is a complete integral test suite.

Signed-off-by: Allen Xu <[email protected]>
@revans2
Copy link
Collaborator

revans2 commented Aug 21, 2023

build

@wjxiz1992 wjxiz1992 merged commit 9d64c89 into NVIDIA:branch-23.10 Aug 22, 2023
26 of 27 checks passed
mythrocks pushed a commit to mythrocks/spark-rapids that referenced this pull request Aug 24, 2023
* Init entry point to data generation for scale test

Signed-off-by: Allen Xu <[email protected]>

* add date range

Signed-off-by: Allen Xu <[email protected]>

* add correlatedKeyGroup settings for key groups

Signed-off-by: Allen Xu <[email protected]>

* refine NOTICE file

Signed-off-by: Allen Xu <[email protected]>

* shorten the code when dealing with data format

Signed-off-by: Allen Xu <[email protected]>

* resolve more comments and add doc for usage

Signed-off-by: Allen Xu <[email protected]>

* add --overwrite argument and resolve some comments

Signed-off-by: Allen Xu <[email protected]>

* style update,unblock CI

Signed-off-by: Allen Xu <[email protected]>

---------

Signed-off-by: Allen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Write entry point to generate data for scale testing.
4 participants