-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Location Providers #1452
Conversation
LocationProvider
s
pyiceberg/table/__init__.py
Outdated
module_name, class_name = ".".join(path_parts[:-1]), path_parts[-1] | ||
module = importlib.import_module(module_name) | ||
class_ = getattr(module, class_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, wonder if we should reduce duplication between this and file IO loading.
pyiceberg/io/pyarrow.py
Outdated
@@ -2622,13 +2631,15 @@ def _dataframe_to_data_files( | |||
property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES, | |||
default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT, | |||
) | |||
location_provider = load_location_provider(table_location=table_metadata.location, table_properties=table_metadata.properties) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't love this. I wanted to do something like this and cache on at least the Transaction
(which this method is exclusively invoked by) but the problem I think is that properties can change on the Transaction
, potentially changing the location provider to be used. I suppose we can update that provider on a property change (or maybe any metadata change) but unsure if this complexity is even worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats an interesting edge case. it seems like an anti-pattern to change the table property and write in the same transaction, although its currently allowed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3555932 (fyi the Java tests don't have one)
pyiceberg/table/locations.py
Outdated
from pyiceberg.utils.properties import property_as_bool | ||
|
||
|
||
class DefaultLocationProvider(LocationProvider): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest difference vs the Java implementations is that I've not supported write.data.path
here. I think it's natural for write.metadata.path
to be supported alongside this so this would be a larger and arguably location-provider-independent change? Can look into it as a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! would be great to have write.data.path
and write.metadata.path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opened an issue on supporting write.data.path
and write.metadata.path
#1492
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry guys, didn't notice this thread until now.
pyiceberg/table/__init__.py
Outdated
@@ -192,6 +195,14 @@ class TableProperties: | |||
WRITE_PARTITION_SUMMARY_LIMIT = "write.summary.partition-limit" | |||
WRITE_PARTITION_SUMMARY_LIMIT_DEFAULT = 0 | |||
|
|||
WRITE_LOCATION_PROVIDER_IMPL = "write.location-provider.impl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though the docs say that the default is null
, having a constant for this being None
felt unnecessary
return ( | ||
f"{prefix}/{hashed_path}/{data_file_name}" | ||
if self._include_partition_paths | ||
else f"{prefix}/{hashed_path}-{data_file_name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that disabling include_partition_paths
affects paths of non-partitioned data files. I've matched Java behaviour here but it does feel odd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an interesting case, do we have a test to show this behavior explicitly? i think it'll be valuable to refer to it at a later time
TableProperties.WRITE_OBJECT_STORE_PARTITIONED_PATHS_DEFAULT, | ||
) | ||
|
||
def new_data_location(self, data_file_name: str, partition_key: Optional[PartitionKey] = None) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to make this as consistent with its Java counter-part so file locations are consistent too. This means hashing on both the partition key and the data file name below, and using the same hash function.
Seemed reasonable to port over the the object storage stuff in this PR, given that the original issue #861 mentions this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Iceberg is mainly focussed on object-stores, I'm leaning towards making the ObjectStorageLocationProvider
the default. Java is a great source of inspiration, but it also holds a lot of historical decisions that are not easy to change, so we should reconsider this at PyIceberg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this great suggestion and context! I agree:
- I made this the default. The
MANIFEST_MERGE_ENABLED_DEFAULT
property already differs from Java and the docs which reassures me. I did still add a short comment besideOBJECT_STORE_ENABLED_DEFAULT
to indicate that it differs. - I renamed
DefaultLocationProvider
toSimpleLocationProvider
because it's no longer the default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ cc @kevinjqliu, how does this sound to you? I realise the concerns you raised re things silently working differently with Java and PyIceberg seem a little contradicting with the above (but I think it's fine).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I've not yet changed WRITE_OBJECT_STORE_PARTITIONED_PATHS_DEFAULT
to False
(Java/docs have true) even though that's more aligned with object storage - from the docs:
We have also added a new table property write.object-storage.partitioned-paths that if set to false(default=true), this will omit the partition values from the file path. Iceberg does not need these values in the file path and setting this value to false can further reduce the key size.
I'm very open to be swayed / discuss this. After reading through apache/iceberg#11112 it seems there was a strong case for still supporting partition values in paths though I haven't been able to flesh it out fully. Perhaps it's backwards compatibility, for folks that inspect storage to see how their files are actually laid out; it does group them together nicely.
I'd be happy to change the default if there's reason for it. The readability of file paths will arguably anyway decrease with these hashes so the above might be a non-issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While im in favor of making ObjectStorageLocationProvider
the default for pyiceberg, i'd prefer to do so in a follow-up PR.
I like having this PR solely to implement the concept of LocationProvider and the ObjectStorageProvider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While im in favor of making ObjectStorageLocationProvider the default for pyiceberg, i'd prefer to do so in a follow-up PR.
I like having this PR solely to implement the concept of LocationProvider and the ObjectStorageProvider
Makes sense! We can have the discussion regarding defaults there. I'd like to keep the SimpleLocationProvider
naming change from Default
here though and discuss which provider should be the default in the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM! 🚀
tests/table/test_locations.py
Outdated
# Field name is not encoded but partition value is - this differs from the Java implementation | ||
# https://github.com/apache/iceberg/blob/cdf748e8e5537f13d861aa4c617a51f3e11dc97c/core/src/test/java/org/apache/iceberg/TestLocationProvider.java#L304 | ||
assert partition_segment == "part#field=example%23val" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put up #1457 - I'll remove this special-character testing (that the Java test counterpart does) here because it'll be tested in that PR.
return f"custom_location_provider/{data_file_name}" | ||
|
||
|
||
def test_default_location_provider() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests in this file are inspired by https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/TestLocationProvider.java.
The hash functions are the same so those constants are unchanged.
fc674f4
to
d9e6c6a
Compare
fcea1ec
to
23ef8f5
Compare
@Fokko, think this is ready for review now! I've implemented this for write codepaths - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Generally LGTM, i left a few nit comments.
This matches the behavior of the Java implementation. However, if we're reusing the same property (write.location-provider.impl
), then there's a conflict when loading in both Java and Python. I wonder if we should add a python specific property, otherwise location-provider will only work in one of the implementations and might error in the other.
pyiceberg/table/locations.py
Outdated
from pyiceberg.utils.properties import property_as_bool | ||
|
||
|
||
class DefaultLocationProvider(LocationProvider): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! would be great to have write.data.path
and write.metadata.path
pyiceberg/table/locations.py
Outdated
HASH_BINARY_STRING_BITS = 20 | ||
ENTROPY_DIR_LENGTH = 4 | ||
ENTROPY_DIR_DEPTH = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move these into ObjectStoreLocationProvider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense esp given the file has now grown. It's pretty unreadable to prefix all the constants here with ObjectStoreLocationProvider
though - I'll think about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we had issues dealing with constants in the file itself. https://github.com/apache/iceberg-python/pull/1217/files#diff-942c2f54eac4f30f1a1e2fa18b719e17cc1cb03ad32908a402c4ba3abe9eca63L37-L38
if its only used in ObjectStoreLocationProvider
, i think its better to be in the class.
but also this is a nit comment :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully agree that it should be within the class - will find a way to do it readably 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyiceberg/io/pyarrow.py
Outdated
@@ -2622,13 +2631,15 @@ def _dataframe_to_data_files( | |||
property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES, | |||
default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT, | |||
) | |||
location_provider = load_location_provider(table_location=table_metadata.location, table_properties=table_metadata.properties) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats an interesting edge case. it seems like an anti-pattern to change the table property and write in the same transaction, although its currently allowed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, added a few nit comments
pyiceberg/io/pyarrow.py
Outdated
def write_file(io: FileIO, table_metadata: TableMetadata, tasks: Iterator[WriteTask]) -> Iterator[DataFile]: | ||
def write_file( | ||
io: FileIO, location_provider: LocationProvider, table_metadata: TableMetadata, tasks: Iterator[WriteTask] | ||
) -> Iterator[DataFile]: | ||
from pyiceberg.table import DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE, TableProperties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want location_provider: LocationProvider
last for backwards compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about leaving the signature as before and doing load_location_provider
at the start of this function (above parquet_writer_kwargs = _get_parquet_writer_kwargs(table_metadata.properties)
instead of in _dataframe_to_data_files
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would mean we need to run load_location_provider
per data file and can potentially get expensive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so? At the start of the function means not in write_parquet
- the location_provider
loaded would be just be used within that, similar to parquet_writer_kwargs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah makes sense, write_parquet is called once per _dataframe_to_data_files
we can do that to preserve backwards compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! (typo correction: write_file
above 😄)
tbl = _create_table( | ||
session_catalog=session_catalog, | ||
identifier=f"default.arrow_table_v{format_version}_with_null_partitioned_on_col_{part_col}", | ||
properties={"format-version": str(format_version), "write.object-storage.enabled": True}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use the constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also mention in test that write.object-storage.partitioned-paths
defaults to True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that it's the default, I removed it here (and added a comment for both defaults), so there's one integration test that checks not specifying it. I used the constants in all other places in the integration tests.
tests/table/test_locations.py
Outdated
assert len(parts) == 7 | ||
assert parts[0] == "table_location" | ||
assert parts[1] == "data" | ||
# Entropy directories in the middle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this test is called test_object_storage_injects_entropy
should we test the entropy part?
similar to
# Entropy binary directories should have been injected
for dir_name in parts[6:10]:
assert dir_name
assert all(c in "01" for c in dir_name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was inspired by https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/TestLocationProvider.java#L275. fyi, my reading of this was: this tests that there's some stuff in the middle. The later test_hash_injection
/ testHashInjection
(here vs Java) tests that the hashes themselves are correct.
(To me, it made sense for the integration test to have the balance of checking both entropy and that they're binary-hashed, but not the hash itself because that feels unit-test-y)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fair for this test to check that it's binary too. That way, if e.g. the wrong hash method is used, this unit test still passes (the provider does indeed inject entropy) but the hash-injection is wrong (so that unit test fails).
This sounds good to me, thanks!
Co-authored-by: Kevin Liu <[email protected]>
Great point! I've made the change to |
pyiceberg/table/__init__.py
Outdated
WRITE_PY_LOCATION_PROVIDER_IMPL = "write.py-location-provider.impl" | ||
|
||
OBJECT_STORE_ENABLED = "write.object-storage.enabled" | ||
OBJECT_STORE_ENABLED_DEFAULT = True # Differs from Java + docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(See discussion #1452 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, thanks for following up.
I think we'd also want to add docs around this feature! Maybe similar to FileIO, we can add a new section about LocationProvider
Great point! I've made the change to WRITE_PY_LOCATION_PROVIDER_IMPL = "write.py-location-provider.impl" (happy to take suggestions) - inspired by io-impl → py-io-impl.
Bringing this comment up, I want to see what other think of this pattern.
pyiceberg/io/pyarrow.py
Outdated
def write_file(io: FileIO, table_metadata: TableMetadata, tasks: Iterator[WriteTask]) -> Iterator[DataFile]: | ||
def write_file( | ||
io: FileIO, location_provider: LocationProvider, table_metadata: TableMetadata, tasks: Iterator[WriteTask] | ||
) -> Iterator[DataFile]: | ||
from pyiceberg.table import DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE, TableProperties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would mean we need to run load_location_provider
per data file and can potentially get expensive
pyiceberg/table/locations.py
Outdated
HASH_BINARY_STRING_BITS = 20 | ||
ENTROPY_DIR_LENGTH = 4 | ||
ENTROPY_DIR_DEPTH = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we had issues dealing with constants in the file itself. https://github.com/apache/iceberg-python/pull/1217/files#diff-942c2f54eac4f30f1a1e2fa18b719e17cc1cb03ad32908a402c4ba3abe9eca63L37-L38
if its only used in ObjectStoreLocationProvider
, i think its better to be in the class.
but also this is a nit comment :P
TableProperties.WRITE_OBJECT_STORE_PARTITIONED_PATHS_DEFAULT, | ||
) | ||
|
||
def new_data_location(self, data_file_name: str, partition_key: Optional[PartitionKey] = None) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While im in favor of making ObjectStorageLocationProvider
the default for pyiceberg, i'd prefer to do so in a follow-up PR.
I like having this PR solely to implement the concept of LocationProvider and the ObjectStorageProvider
return ( | ||
f"{prefix}/{hashed_path}/{data_file_name}" | ||
if self._include_partition_paths | ||
else f"{prefix}/{hashed_path}-{data_file_name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an interesting case, do we have a test to show this behavior explicitly? i think it'll be valuable to refer to it at a later time
@kevinjqliu good point. Can we do this in a separate PR? Preferably after the defaults discussion, so then we can say something similar to FileIO that has "By default, PyIceberg will..." maybe. (BTW, all comments have now been addressed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I would like to also include documentations about the LocationProvider, but we do that as a follow up
I think we should document:
LocationProvider
SimpleLocationProvider
ObjectStoreLocationProvider
Loading a Custom LocationProvider
And new table properties:
WRITE_PY_LOCATION_PROVIDER_IMPL = "write.py-location-provider.impl"
OBJECT_STORE_ENABLED = "write.object-storage.enabled"
OBJECT_STORE_ENABLED_DEFAULT = False
WRITE_OBJECT_STORE_PARTITIONED_PATHS = "write.object-storage.partitioned-paths"
WRITE_OBJECT_STORE_PARTITIONED_PATHS_DEFAULT = True
Thanks @smaheshwar-pltr for working on this and @Fokko for the review :) |
Closes #861.
As the issue suggests, introduces a
LocationProvider
interface with the default and object-store-optimised implementations (the latter can be enabled via the newly-introduced table properties). This is pluggable, just like FileIO.Largely inspired by and consistent with the Java implementation.