Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[KYUUBI #5550] Optimizing TPC-DS dataset generation for 10x speedup
### _Why are the changes needed?_ 1. This PR fixes the precision loss issue in `xx_gmt_offset`. Please note that since `xx_gmt_offset` is of integer type, there is no actual loss of precision. ``` trino:tiny> select cc_gmt_offset from call_center ; cc_gmt_offset --------------- -5.00 -5.00 ``` Before this PR: ```scala scala> spark.sql("select cc_gmt_offset from tpcds.tiny.call_center").show +-------------+ |cc_gmt_offset| +-------------+ | -5| | -5| +-------------+ ``` After this PR: ```scala scala> spark.sql("select cc_gmt_offset from tpcds.tiny.call_center").show +-------------+ |cc_gmt_offset| +-------------+ | -5.00| | -5.00| +-------------+ ``` 2. This PR accelerates the generation of the TPC-DS dataset by optimizing the way Rows are generated. Before this PR, The previous process involved converting **Trino TableRow** into **String Row** and then further into **Spark InternalRow**. After this PR, we have streamlined the process by directly converting **Trino TableRow** into **Spark InternalRow**, eliminating unnecessary toString operations. This change significantly improves the speed of TPC-DS dataset generation. ```scala spark.table("tpcds.sf1000.catalog_sales").foreach(r => ()) ``` Task Duration before this PR: ![截屏2023-10-30 下午4 04 12](https://github.com/apache/kyuubi/assets/8537877/69bd9938-2886-4044-99b8-79ed20d4791c) Task Duration after this PR: ![截屏2023-10-30 下午4 02 08](https://github.com/apache/kyuubi/assets/8537877/ddfe01a9-081c-41b5-b82c-a0934dd8686c) ### _How was this patch tested?_ - New UT `tpcds.tiny count and checksum` - Compare checksum values before and after this PR on the 1TB dataset | table_name | count | checksum | |------------------------|-----------------|---------------------------| | call_center | 42 | 95607401475 | | catalog_page | 30000 | 64470199469085 | | catalog_returns | 143996756 | 309202327050775220 | | catalog_sales | 1439980416 | 3092267266923848000 | | customer | 12000000 | 25769069905636795 | | customer_address | 6000000 | 12889423380880973 | | customer_demographics | 1920800 | 4124183189708148 | | date_dim | 73049 | 156926081012862 | | household_demographics | 7200 | 15494873325812 | | income_band | 20 | 41180951007 | | inventory | 783000000 | 1681487454682584456 | | item | 300000 | 643000708260945 | | promotion | 1500 | 3270935493709 | | reason | 65 | 118806664977 | | ship_mode | 20 | 52349078860 | | store | 1002 | 2096408105720 | | store_returns | 287999764 | 618451374856897114 | | store_sales | 2879987999 | 6184670571185100839 | | time_dim | 86400 | 186045071019485 | | warehouse | 20 | 31374161844 | | web_page | 3000 | 6502456139647 | | web_returns | 71997522 | 154614570845312413 | | web_sales | 720000376 | 1546188452223821591 | | web_site | 54 | 107485781738 | ### _Was this patch authored or co-authored using generative AI tooling?_ No Closes #5562 from cfmcgrady/tpcds-perf. Closes #5550 a789b9e [Fu Chen] maxPartitionBytes=384m 659e209 [Fu Chen] style 916f6d2 [Fu Chen] unnecessary change 75981af [Fu Chen] tpcds perf Authored-by: Fu Chen <[email protected]> Signed-off-by: Cheng Pan <[email protected]> (cherry picked from commit 4c915b7) Signed-off-by: Cheng Pan <[email protected]>
- Loading branch information