Change `--string-view` to only apply to parquet formats #11663

XiangpengHao · 2024-07-25T22:37:45Z

Note targets string-view2 branch

Which issue does this PR close?

Closes #.

Rationale for this change

~~I think the current Parquet opener will read the schema upon opening the file. I guess we should use the provided table schema instead. I'm not super sure, so please correct me if I missed anything~~

I realized that the table schema and the parquet file schema can be different, and we need to transform the schema twice. Because of that, I think we should change the configuration to apply to parquet fomat only, and when we have more bandwidth, we can make it work for other formats.

This is important for StringView related schema transformation. Without this, we will load the schema to Utf8, then cast to Utf8View, which is super slow.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb

Looks like some tests are failing

Maybe what is needed is to do the same Utf8 --> Utf8View transformation on the file schema (rather than using the table schema)

XiangpengHao · 2024-07-26T15:46:29Z

Maybe what is needed is to do the same Utf8 --> Utf8View transformation on the file schema (rather than using the table schema)

Absolutely! I've updated the related code so that the config only applies to parquet files (for now)

alamb

Makes sense to me -- thank you @XiangpengHao

I filed #11682 so we don't forget to turn this on by default

… some ClickBench queries (not on by default) (#11667) * Pin to pre-release version of arrow 52.2.0 * Update for deprecated method * Add a config to force using string view in benchmark (#11514) * add a knob to force string view in benchmark * fix sql logic test * update doc * fix ci * fix ci only test * Update benchmarks/src/util/options.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion/common/src/config.rs Co-authored-by: Andrew Lamb <[email protected]> * update tests --------- Co-authored-by: Andrew Lamb <[email protected]> * Add String view helper functions (#11517) * add functions * add tests for hash util * Add ArrowBytesViewMap and ArrowBytesViewSet (#11515) * Update `string-view` branch to arrow-rs main (#10966) * Pin to arrow main * Fix clippy with latest arrow * Uncomment test that needs new arrow-rs to work * Update datafusion-cli Cargo.lock * Update Cargo.lock * tapelo * merge * update cast * consistent dep * fix ci * add more tests * make doc happy * update new implementation * fix bug * avoid unused dep * update dep * update * fix cargo check * update doc * pick up the comments change again --------- Co-authored-by: Andrew Lamb <[email protected]> * Enable `GroupValueBytesView` for aggregation with StringView types (#11519) * add functions * Update `string-view` branch to arrow-rs main (#10966) * Pin to arrow main * Fix clippy with latest arrow * Uncomment test that needs new arrow-rs to work * Update datafusion-cli Cargo.lock * Update Cargo.lock * tapelo * merge * update cast * consistent dep * fix ci * avoid unused dep * update dep * update * fix cargo check * better group value view aggregation * update --------- Co-authored-by: Andrew Lamb <[email protected]> * Initial support for regex_replace on `StringViewArray` (#11556) * initial support for string view regex * update tests * Add support for Utf8View for date/temporal codepaths (#11518) * Add StringView support for date_part and make_date funcs * run cargo update in datafusion-cli * cargo fmt --------- Co-authored-by: Andrew Lamb <[email protected]> * GC `StringViewArray` in `CoalesceBatchesStream` (#11587) * gc string view when appropriate * make clippy happy * address comments * make doc happy * update style * Add comments and tests for gc_string_view_batch * better herustic * update test * Update datafusion/physical-plan/src/coalesce_batches.rs Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]> * [Bug] fix bug in return type inference of `utf8_to_int_type` (#11662) * fix bug in return type inference * update doc * add tests --------- Co-authored-by: Andrew Lamb <[email protected]> * Fix clippy * Increase ByteViewMap block size to 2MB (#11674) * better default block size * fix related test * Change `--string-view` to only apply to parquet formats (#11663) * use inferenced schema, don't load schema again * move config to parquet-only * update * update * better format * format * update * Implement native support StringView for character length (#11676) * native support for character length * Update datafusion/functions/src/unicode/character_length.rs --------- Co-authored-by: Andrew Lamb <[email protected]> * Remove uneeded patches * cargo fmt --------- Co-authored-by: Xiangpeng Hao <[email protected]> Co-authored-by: Xiangpeng Hao <[email protected]> Co-authored-by: Andrew Duffy <[email protected]>

github-actions bot added the core Core DataFusion crate label Jul 25, 2024

alamb reviewed Jul 26, 2024

View reviewed changes

github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels Jul 26, 2024

XiangpengHao added 2 commits July 26, 2024 10:50

use inferenced schema, don't load schema again

8d960bf

move config to parquet-only

4805e12

XiangpengHao force-pushed the string-view2-schema branch from cb24fc0 to 4805e12 Compare July 26, 2024 14:52

XiangpengHao added 2 commits July 26, 2024 11:31

update

b239d51

update

1632af6

XiangpengHao changed the title ~~Minor use table schema in ParquetOpener~~ Change --string-view to only apply to parquet formatss Jul 26, 2024

XiangpengHao changed the title ~~Change --string-view to only apply to parquet formatss~~ Change --string-view to only apply to parquet formats Jul 26, 2024

XiangpengHao added 2 commits July 26, 2024 11:41

better format

bef4350

format

c56ca13

XiangpengHao requested a review from alamb July 26, 2024 15:46

XiangpengHao changed the title ~~Change --string-view to only apply to parquet formats~~ Change --string-view to only apply to parquet formats Jul 26, 2024

update

b052dd3

alamb approved these changes Jul 27, 2024

View reviewed changes

alamb merged commit 322c3d2 into apache:string-view2 Jul 27, 2024
26 checks passed

alamb mentioned this pull request Jul 27, 2024

Merge string-view2 branch: reading from parquet up to 2x faster for some ClickBench queries (not on by default) #11667

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change `--string-view` to only apply to parquet formats #11663

Change `--string-view` to only apply to parquet formats #11663

XiangpengHao commented Jul 25, 2024 •

edited by alamb

Loading

alamb left a comment

XiangpengHao commented Jul 26, 2024

alamb left a comment

Change --string-view to only apply to parquet formats #11663

Change --string-view to only apply to parquet formats #11663

Conversation

XiangpengHao commented Jul 25, 2024 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

XiangpengHao commented Jul 26, 2024

alamb left a comment

Choose a reason for hiding this comment

Change `--string-view` to only apply to parquet formats #11663

Change `--string-view` to only apply to parquet formats #11663

XiangpengHao commented Jul 25, 2024 •

edited by alamb

Loading