-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: Correct partition transform functions to match spec.md #8192
Spark: Correct partition transform functions to match spec.md #8192
Conversation
@clettieri Thanks for raising this. This is a known issue, and I also run into it. However, maybe it is better to add |
format/spec.md
Outdated
| **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp`, `timestamptz` | `int` | | ||
| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp`, `timestamptz` | `int` | | ||
| **`hour`** | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | `timestamp`, `timestamptz` | `int` | | ||
| **`years`** | Extract a date or timestamp year, as years from 1970 | `date`, `timestamp`, `timestamptz` | `int` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might impact Trino, Dremio and other engines which has naming conventions according to spec.
https://trino.io/docs/current/connector/iceberg.html#partitioned-tables
https://docs.dremio.com/cloud/reference/sql/commands/create-table/
IMO, better to change the spark transform name and deprecate the old ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that.
Would it be useful to keep the unintended "backward" compatibility though? Or do you think just force the Spark API to match the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to maintain backward compatibility. I would suggest adding the singular options and also keeping the plural ones as well. They don't conflict and avoids people to run into errors. Also curious about what others think of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added singular options to each Spark3 version. How does this look?
I can add some tests as well if you think that would be useful.
Also, should I change the prefix of this PR to Spark:
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add some tests as well if you think that would be useful.
You can add a test in TestAlterTablePartitionFields
to make sure that we don't break this in the future. It could be something similar as testSparkTableAddDropPartitions
where you use the singular transform names.
Also, should I change the prefix of this PR to Spark: ?
Yes please :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the tests, what if I replicate these but for some or all of year, month, hour, day
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be perfect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the tests.
I ran /gradlew :iceberg-spark:iceberg-spark-extensions-3.4_2.12:test --tests TestAlterTablePartitionFields
locally and saw things pass. LMK if I can do anything else here (I'll try to catch up on the contributing guidelines tomorrow). Thanks!
433c8bf
to
10ddc9b
Compare
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java
Outdated
Show resolved
Hide resolved
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also ran into this, more than one :3 Would be good to get this in! Thanks for working on this @clettieri
I would want @rdblue to check this. If I remember correctly we originally did this because the base implementation happened to be wrong so we had to add another implementation that took account for 0 correctly. The other issue here is that the non-plural names are Spark native functions, is that an issue? |
PartitionSpec expected = | ||
PartitionSpec.builderFor(table.schema()).withSpecId(1).year("ts").build(); | ||
|
||
Assert.assertEquals("Should have new spec field", expected, table.spec()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be great to not add more JUnit4-style assertions (even if the underlying class already uses them). Can you please convert those to AssertJ? That will make migration to JUnit5 easier. See https://iceberg.apache.org/contribute/#testing for some details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nastra updated the tests, let me know what you think
@clettieri, sorry it took so long to get back to this PR. Could you, please, add tests that the new name is also supported in CREATE TABLE statements too? I believe this change only covers ALTER TABLE. In addition, this PR should be rebased and include Spark 3.5. |
Hey @aokolnychyi, I can rebase and include Spark 3.5 👍🏼 . Regarding the CREATE TABLE tests though, I don't see any for the previous transform functions and am unclear what new functionality they would test compared to what we have now. The current tests seem sufficient IMO to validate the transform functions can be called. Am I missing something here? Should I create a new test class to verify creating tables with these transform functions succeed? Thanks! |
569e5bf
to
0b9c76e
Compare
@clettieri, what about |
Looks like there are some spotless issues that fail the build. |
Yep, I see that. I'll try to get to that on Monday. I'm a bit new to the JVM/Spark world and was having some trouble switching Spark environments to run Gradle locally. :) |
Thank you, @clettieri! |
I fixed a typo and opened clettieri#1 with the spotless changes. |
Apply spotless
Thank you for saving me the time this morning :) |
thanks @clettieri |
Thanks for fixing. I couldn't take a look at it again before. But we missed to update the documentation in https://github.com/apache/iceberg/blob/master/docs/spark-ddl.md. @nk1506 would you like to work on it? |
This particular apache#8192 has fixed the code but it seems documented is not in sync. Hence the follow up PR.
This particular apache#8192 has fixed the code but it seems documented is not in sync. Hence the follow up PR.
This particular apache#8192 has fixed the code but it seems documented is not in sync. Hence the follow up PR.
Given the spec for the time transform functions, this PR adds the singular (instead of plural) names of the transform functions.
The spec specifies
year, month, hour, day
, but previously Spark only supported the plural version of these words (years, months, etc
).Hopefully this saves someone else a few cycles of debugging :)