Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support rebase checking for nested dates and timestamps #9617

Merged
merged 8 commits into from
Nov 3, 2023

Conversation

ttnghia
Copy link
Collaborator

@ttnghia ttnghia commented Nov 2, 2023

This implements recursive checking to see if the input column contains nested dates/timestamps that need to be rebased.

Some directly related tests are fixed. Some others need to be cleanup/fixed after having LEGACY rebase mode support in parquet reading.

Partially contributes to #1126.

@ttnghia ttnghia added the task Work required that improves the product but is not user facing label Nov 2, 2023
@ttnghia ttnghia requested a review from revans2 November 2, 2023 21:06
@ttnghia ttnghia self-assigned this Nov 2, 2023
Signed-off-by: Nghia Truong <[email protected]>
@ttnghia
Copy link
Collaborator Author

ttnghia commented Nov 2, 2023

build

@ttnghia ttnghia marked this pull request as ready for review November 2, 2023 21:27
@ttnghia
Copy link
Collaborator Author

ttnghia commented Nov 3, 2023

build

s"${SparkShimImpl.int96ParquetRebaseReadKey} is EXCEPTION")
}
case "CORRECTED" => // Good
case "EXCEPTION" | "CORRECTED" => // Good
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is EXCEPTION ok now? Do we have any tests that exercise this path?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXCEPTION is handled somewhere else, after loading data. This is the planning step, thus we are OK to run on the GPU with it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have test for it too. That is one of the two that I modified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused because I didn't see EXCEPTION explicitly in the tests, but I checked the docs, and it is the default setting for datetimeRebaseModeInRead, so it is covered.

@ttnghia ttnghia merged commit 401d0d8 into NVIDIA:branch-23.12 Nov 3, 2023
37 checks passed
@ttnghia ttnghia deleted the check_rebase_nested branch November 3, 2023 16:37
ttnghia added a commit to ttnghia/spark-rapids that referenced this pull request Nov 6, 2023
ttnghia added a commit to ttnghia/spark-rapids that referenced this pull request Nov 6, 2023
ttnghia added a commit that referenced this pull request Nov 6, 2023
ttnghia added a commit to ttnghia/spark-rapids that referenced this pull request Nov 6, 2023
…IA#9617)"

This reverts commit 401d0d8.

Signed-off-by: Nghia Truong <[email protected]>

# Conflicts:
#	sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala
ttnghia added a commit that referenced this pull request Nov 17, 2023
…owed timestamp [databricks] (#9736)

* Add check for nested types

* Add check for nested types

* Recursively check for rebasing

* Extract common code

* Allow nested type in rebase check

* Enable nested timestamp in roundtrip test

* Fix another test

Signed-off-by: Nghia Truong <[email protected]>

* Enable `LEGACY` rebase in read

* Remove comment

* Change function/class signatures

* Complete modification

* Misc

Signed-off-by: Nghia Truong <[email protected]>

* Add explicit type

Signed-off-by: Nghia Truong <[email protected]>

* Rename file and add some stuff in DateTimeRebaseHelpers.scala

* Move file and rename class

* Adopt new enum type

Signed-off-by: Nghia Truong <[email protected]>

* Add name for the enum classes

* Change exception messages

* Does not yet support legacy rebase in read

Signed-off-by: Nghia Truong <[email protected]>

* Change legacy to corrected mode

Signed-off-by: Nghia Truong <[email protected]>

* Extract common code

Signed-off-by: Nghia Truong <[email protected]>

* Rename functions

Signed-off-by: Nghia Truong <[email protected]>

* Reformat

Signed-off-by: Nghia Truong <[email protected]>

* Make classes serializable

Signed-off-by: Nghia Truong <[email protected]>

* Revert "Support rebase checking for nested dates and timestamps (#9617)"

This reverts commit 401d0d8.

Signed-off-by: Nghia Truong <[email protected]>

# Conflicts:
#	sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

* Implement date time rebase

* Optimize rebase op

* Change comment

Signed-off-by: Nghia Truong <[email protected]>

* Move tests

* Add test for datatime rebase

Signed-off-by: Nghia Truong <[email protected]>

* Various changes

Signed-off-by: Nghia Truong <[email protected]>

* Various changes

Signed-off-by: Nghia Truong <[email protected]>

# Conflicts:
#	sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

* Fix compile errors

Signed-off-by: Nghia Truong <[email protected]>

* Fix comments

Signed-off-by: Nghia Truong <[email protected]>

* Fix indentations

Signed-off-by: Nghia Truong <[email protected]>

* Change comments and indentations

Signed-off-by: Nghia Truong <[email protected]>

* Allow nested check for rebase

* Write different timestamp types in test

* Fix conversion if timestamp is not micros

* Rename var

* Dont have to down cast after up cast

Signed-off-by: Nghia Truong <[email protected]>

* Change comment

Signed-off-by: Nghia Truong <[email protected]>

* Still cast timestamp to the old type after rebasing

Signed-off-by: Nghia Truong <[email protected]>

* Rename test

Signed-off-by: Nghia Truong <[email protected]>

* Should not transform non-datetime types

Signed-off-by: Nghia Truong <[email protected]>

* Fix test

* Update tests

Signed-off-by: Nghia Truong <[email protected]>

* Enable int96 rebase in write

* Change tests

Signed-off-by: Nghia Truong <[email protected]>

* Complete tests

Signed-off-by: Nghia Truong <[email protected]>

* Revert unrelated changes

Signed-off-by: Nghia Truong <[email protected]>

* Change configs

Signed-off-by: Nghia Truong <[email protected]>

* Merge tests

Signed-off-by: Nghia Truong <[email protected]>

* Simplify test data

Signed-off-by: Nghia Truong <[email protected]>

* Add a new write test

Signed-off-by: Nghia Truong <[email protected]>

* Add a mixed rebase test

Signed-off-by: Nghia Truong <[email protected]>

* Change tests

Signed-off-by: Nghia Truong <[email protected]>

* Fix `seed` in tests

Signed-off-by: Nghia Truong <[email protected]>

* Rename tests

Signed-off-by: Nghia Truong <[email protected]>

* Remove seed override

* Change TimestampGen

Signed-off-by: Nghia Truong <[email protected]>

* Remove default seed

Signed-off-by: Nghia Truong <[email protected]>

* Add default seed

Signed-off-by: Nghia Truong <[email protected]>

* Remove default seed

Signed-off-by: Nghia Truong <[email protected]>

---------

Signed-off-by: Nghia Truong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants