Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Correct partition transform functions to match spec.md #8192

Merged
merged 12 commits into from
Sep 25, 2023

Conversation

clettieri
Copy link
Contributor

@clettieri clettieri commented Jul 31, 2023

Given the spec for the time transform functions, this PR adds the singular (instead of plural) names of the transform functions.

The spec specifies year, month, hour, day, but previously Spark only supported the plural version of these words (years, months, etc).

Hopefully this saves someone else a few cycles of debugging :)

@Fokko
Copy link
Contributor

Fokko commented Jul 31, 2023

@clettieri Thanks for raising this. This is a known issue, and I also run into it. However, maybe it is better to add day as an option to Spark since the spec is considered the source of truth.

format/spec.md Outdated
| **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp`, `timestamptz` | `int` |
| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp`, `timestamptz` | `int` |
| **`hour`** | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | `timestamp`, `timestamptz` | `int` |
| **`years`** | Extract a date or timestamp year, as years from 1970 | `date`, `timestamp`, `timestamptz` | `int` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might impact Trino, Dremio and other engines which has naming conventions according to spec.

https://trino.io/docs/current/connector/iceberg.html#partitioned-tables
https://docs.dremio.com/cloud/reference/sql/commands/create-table/

IMO, better to change the spark transform name and deprecate the old ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that.

Would it be useful to keep the unintended "backward" compatibility though? Or do you think just force the Spark API to match the spec?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to maintain backward compatibility. I would suggest adding the singular options and also keeping the plural ones as well. They don't conflict and avoids people to run into errors. Also curious about what others think of it.

Copy link
Contributor Author

@clettieri clettieri Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added singular options to each Spark3 version. How does this look?

I can add some tests as well if you think that would be useful.

Also, should I change the prefix of this PR to Spark: ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add some tests as well if you think that would be useful.

You can add a test in TestAlterTablePartitionFields to make sure that we don't break this in the future. It could be something similar as testSparkTableAddDropPartitions where you use the singular transform names.

Also, should I change the prefix of this PR to Spark: ?

Yes please :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the tests, what if I replicate these but for some or all of year, month, hour, day ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be perfect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the tests.

I ran /gradlew :iceberg-spark:iceberg-spark-extensions-3.4_2.12:test --tests TestAlterTablePartitionFields locally and saw things pass. LMK if I can do anything else here (I'll try to catch up on the contributing guidelines tomorrow). Thanks!

@clettieri clettieri force-pushed the docs-fix-partition-transforms branch from 433c8bf to 10ddc9b Compare August 1, 2023 13:41
@github-actions github-actions bot added the spark label Aug 1, 2023
@clettieri clettieri changed the title Docs: Correct partition transform functions on spec.md Spark: Correct partition transform functions on spec.md Aug 1, 2023
@clettieri clettieri changed the title Spark: Correct partition transform functions on spec.md Spark: Correct partition transform functions to match spec.md Aug 1, 2023
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also ran into this, more than one :3 Would be good to get this in! Thanks for working on this @clettieri

@Fokko Fokko added this to the Iceberg 1.4.0 milestone Aug 2, 2023
@RussellSpitzer
Copy link
Member

I would want @rdblue to check this. If I remember correctly we originally did this because the base implementation happened to be wrong so we had to add another implementation that took account for 0 correctly.

The other issue here is that the non-plural names are Spark native functions, is that an issue?

@nastra nastra requested a review from rdblue September 13, 2023 05:58
PartitionSpec expected =
PartitionSpec.builderFor(table.schema()).withSpecId(1).year("ts").build();

Assert.assertEquals("Should have new spec field", expected, table.spec());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great to not add more JUnit4-style assertions (even if the underlying class already uses them). Can you please convert those to AssertJ? That will make migration to JUnit5 easier. See https://iceberg.apache.org/contribute/#testing for some details

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra updated the tests, let me know what you think

@aokolnychyi
Copy link
Contributor

@clettieri, sorry it took so long to get back to this PR.

Could you, please, add tests that the new name is also supported in CREATE TABLE statements too? I believe this change only covers ALTER TABLE. In addition, this PR should be rebased and include Spark 3.5.

@clettieri
Copy link
Contributor Author

clettieri commented Sep 20, 2023

@clettieri, sorry it took so long to get back to this PR.

Could you, please, add tests that the new name is also supported in CREATE TABLE statements too? I believe this change only covers ALTER TABLE. In addition, this PR should be rebased and include Spark 3.5.

Hey @aokolnychyi, I can rebase and include Spark 3.5 👍🏼 .

Regarding the CREATE TABLE tests though, I don't see any for the previous transform functions and am unclear what new functionality they would test compared to what we have now. The current tests seem sufficient IMO to validate the transform functions can be called. Am I missing something here? Should I create a new test class to verify creating tables with these transform functions succeed?

Thanks!

@aokolnychyi
Copy link
Contributor

@clettieri, what about TestCreateTable? Unlike what is covered by new tests, we should also check the new syntax is supported by Spark without extensions.

@aokolnychyi
Copy link
Contributor

Looks like there are some spotless issues that fail the build.

@clettieri
Copy link
Contributor Author

Looks like there are some spotless issues that fail the build.

Yep, I see that. I'll try to get to that on Monday. I'm a bit new to the JVM/Spark world and was having some trouble switching Spark environments to run Gradle locally. :)

@aokolnychyi
Copy link
Contributor

Thank you, @clettieri!

@rdblue
Copy link
Contributor

rdblue commented Sep 24, 2023

I fixed a typo and opened clettieri#1 with the spotless changes.

@clettieri
Copy link
Contributor Author

I fixed a typo and opened clettieri#1 with the spotless changes.

Thank you for saving me the time this morning :)

@nastra
Copy link
Contributor

nastra commented Sep 25, 2023

thanks @clettieri

@nastra nastra merged commit c0bed74 into apache:master Sep 25, 2023
37 checks passed
@ajantha-bhat
Copy link
Member

ajantha-bhat commented Sep 25, 2023

Thanks for fixing. I couldn't take a look at it again before.
LGTM.

But we missed to update the documentation in https://github.com/apache/iceberg/blob/master/docs/spark-ddl.md.

@nk1506 would you like to work on it?

nk1506 added a commit to nk1506/iceberg that referenced this pull request Sep 25, 2023
This particular apache#8192 has fixed the code but it seems documented is not in sync. Hence the follow up PR.
nk1506 added a commit to nk1506/iceberg that referenced this pull request Sep 25, 2023
This particular apache#8192 has fixed the code but it seems documented is not in sync. Hence the follow up PR.
nk1506 added a commit to nk1506/iceberg that referenced this pull request Sep 25, 2023
This particular apache#8192 has fixed the code but it seems documented is not in sync. Hence the follow up PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants