Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Location Providers #1452

Merged
merged 25 commits into from
Jan 10, 2025
Merged

Conversation

smaheshwar-pltr
Copy link
Contributor

@smaheshwar-pltr smaheshwar-pltr commented Dec 20, 2024

Closes #861.

As the issue suggests, introduces a LocationProvider interface with the default and object-store-optimised implementations (the latter can be enabled via the newly-introduced table properties). This is pluggable, just like FileIO.

Largely inspired by and consistent with the Java implementation.

@smaheshwar-pltr smaheshwar-pltr changed the title WIP: Support LocationProviders WIP: Support LocationProviders Dec 20, 2024
@smaheshwar-pltr smaheshwar-pltr changed the title WIP: Support LocationProviders WIP: Support Location Providers Dec 20, 2024
Comment on lines 1671 to 1673
module_name, class_name = ".".join(path_parts[:-1]), path_parts[-1]
module = importlib.import_module(module_name)
class_ = getattr(module, class_name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, wonder if we should reduce duplication between this and file IO loading.

@@ -2622,13 +2631,15 @@ def _dataframe_to_data_files(
property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT,
)
location_provider = load_location_provider(table_location=table_metadata.location, table_properties=table_metadata.properties)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't love this. I wanted to do something like this and cache on at least the Transaction (which this method is exclusively invoked by) but the problem I think is that properties can change on the Transaction, potentially changing the location provider to be used. I suppose we can update that provider on a property change (or maybe any metadata change) but unsure if this complexity is even worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats an interesting edge case. it seems like an anti-pattern to change the table property and write in the same transaction, although its currently allowed

Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3555932 (fyi the Java tests don't have one)

@smaheshwar-pltr smaheshwar-pltr changed the title WIP: Support Location Providers Support Location Providers Dec 20, 2024
from pyiceberg.utils.properties import property_as_bool


class DefaultLocationProvider(LocationProvider):
Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest difference vs the Java implementations is that I've not supported write.data.path here. I think it's natural for write.metadata.path to be supported alongside this so this would be a larger and arguably location-provider-independent change? Can look into it as a follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! would be great to have write.data.path and write.metadata.path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened an issue on supporting write.data.path and write.metadata.path
#1492

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry guys, didn't notice this thread until now.

@@ -192,6 +195,14 @@ class TableProperties:
WRITE_PARTITION_SUMMARY_LIMIT = "write.summary.partition-limit"
WRITE_PARTITION_SUMMARY_LIMIT_DEFAULT = 0

WRITE_LOCATION_PROVIDER_IMPL = "write.location-provider.impl"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though the docs say that the default is null, having a constant for this being None felt unnecessary

return (
f"{prefix}/{hashed_path}/{data_file_name}"
if self._include_partition_paths
else f"{prefix}/{hashed_path}-{data_file_name}"
Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that disabling include_partition_paths affects paths of non-partitioned data files. I've matched Java behaviour here but it does feel odd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an interesting case, do we have a test to show this behavior explicitly? i think it'll be valuable to refer to it at a later time

TableProperties.WRITE_OBJECT_STORE_PARTITIONED_PATHS_DEFAULT,
)

def new_data_location(self, data_file_name: str, partition_key: Optional[PartitionKey] = None) -> str:
Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to make this as consistent with its Java counter-part so file locations are consistent too. This means hashing on both the partition key and the data file name below, and using the same hash function.

Seemed reasonable to port over the the object storage stuff in this PR, given that the original issue #861 mentions this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Iceberg is mainly focussed on object-stores, I'm leaning towards making the ObjectStorageLocationProvider the default. Java is a great source of inspiration, but it also holds a lot of historical decisions that are not easy to change, so we should reconsider this at PyIceberg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great suggestion and context! I agree:

  • I made this the default. The MANIFEST_MERGE_ENABLED_DEFAULT property already differs from Java and the docs which reassures me. I did still add a short comment beside OBJECT_STORE_ENABLED_DEFAULT to indicate that it differs.
  • I renamed DefaultLocationProvider to SimpleLocationProvider because it's no longer the default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ cc @kevinjqliu, how does this sound to you? I realise the concerns you raised re things silently working differently with Java and PyIceberg seem a little contradicting with the above (but I think it's fine).

Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I've not yet changed WRITE_OBJECT_STORE_PARTITIONED_PATHS_DEFAULT to False (Java/docs have true) even though that's more aligned with object storage - from the docs:

We have also added a new table property write.object-storage.partitioned-paths that if set to false(default=true), this will omit the partition values from the file path. Iceberg does not need these values in the file path and setting this value to false can further reduce the key size.

I'm very open to be swayed / discuss this. After reading through apache/iceberg#11112 it seems there was a strong case for still supporting partition values in paths though I haven't been able to flesh it out fully. Perhaps it's backwards compatibility, for folks that inspect storage to see how their files are actually laid out; it does group them together nicely.

I'd be happy to change the default if there's reason for it. The readability of file paths will arguably anyway decrease with these hashes so the above might be a non-issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While im in favor of making ObjectStorageLocationProvider the default for pyiceberg, i'd prefer to do so in a follow-up PR.
I like having this PR solely to implement the concept of LocationProvider and the ObjectStorageProvider

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While im in favor of making ObjectStorageLocationProvider the default for pyiceberg, i'd prefer to do so in a follow-up PR.
I like having this PR solely to implement the concept of LocationProvider and the ObjectStorageProvider

Makes sense! We can have the discussion regarding defaults there. I'd like to keep the SimpleLocationProvider naming change from Default here though and discuss which provider should be the default in the next PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM! 🚀

@smaheshwar-pltr smaheshwar-pltr marked this pull request as ready for review December 20, 2024 14:09
Comment on lines 98 to 100
# Field name is not encoded but partition value is - this differs from the Java implementation
# https://github.com/apache/iceberg/blob/cdf748e8e5537f13d861aa4c617a51f3e11dc97c/core/src/test/java/org/apache/iceberg/TestLocationProvider.java#L304
assert partition_segment == "part#field=example%23val"
Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put up #1457 - I'll remove this special-character testing (that the Java test counterpart does) here because it'll be tested in that PR.

return f"custom_location_provider/{data_file_name}"


def test_default_location_provider() -> None:
Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests in this file are inspired by https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/TestLocationProvider.java.

The hash functions are the same so those constants are unchanged.

@smaheshwar-pltr
Copy link
Contributor Author

@Fokko, think this is ready for review now!

I've implemented this for write codepaths - add_files seems like it should just add the files specified without transforming locations.

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Generally LGTM, i left a few nit comments.

This matches the behavior of the Java implementation. However, if we're reusing the same property (write.location-provider.impl), then there's a conflict when loading in both Java and Python. I wonder if we should add a python specific property, otherwise location-provider will only work in one of the implementations and might error in the other.

from pyiceberg.utils.properties import property_as_bool


class DefaultLocationProvider(LocationProvider):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! would be great to have write.data.path and write.metadata.path

Comment on lines 36 to 38
HASH_BINARY_STRING_BITS = 20
ENTROPY_DIR_LENGTH = 4
ENTROPY_DIR_DEPTH = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move these into ObjectStoreLocationProvider

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense esp given the file has now grown. It's pretty unreadable to prefix all the constants here with ObjectStoreLocationProvider though - I'll think about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we had issues dealing with constants in the file itself. https://github.com/apache/iceberg-python/pull/1217/files#diff-942c2f54eac4f30f1a1e2fa18b719e17cc1cb03ad32908a402c4ba3abe9eca63L37-L38

if its only used in ObjectStoreLocationProvider, i think its better to be in the class.

but also this is a nit comment :P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree that it should be within the class - will find a way to do it readably 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -2622,13 +2631,15 @@ def _dataframe_to_data_files(
property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT,
)
location_provider = load_location_provider(table_location=table_metadata.location, table_properties=table_metadata.properties)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats an interesting edge case. it seems like an anti-pattern to change the table property and write in the same transaction, although its currently allowed

@kevinjqliu kevinjqliu self-requested a review January 2, 2025 20:32
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, added a few nit comments

def write_file(io: FileIO, table_metadata: TableMetadata, tasks: Iterator[WriteTask]) -> Iterator[DataFile]:
def write_file(
io: FileIO, location_provider: LocationProvider, table_metadata: TableMetadata, tasks: Iterator[WriteTask]
) -> Iterator[DataFile]:
from pyiceberg.table import DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE, TableProperties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want location_provider: LocationProvider last for backwards compatibility

Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about leaving the signature as before and doing load_location_provider at the start of this function (above parquet_writer_kwargs = _get_parquet_writer_kwargs(table_metadata.properties) instead of in _dataframe_to_data_files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would mean we need to run load_location_provider per data file and can potentially get expensive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so? At the start of the function means not in write_parquet - the location_provider loaded would be just be used within that, similar to parquet_writer_kwargs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah makes sense, write_parquet is called once per _dataframe_to_data_files

we can do that to preserve backwards compatibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! (typo correction: write_file above 😄)

tests/integration/test_writes/test_partitioned_writes.py Outdated Show resolved Hide resolved
tbl = _create_table(
session_catalog=session_catalog,
identifier=f"default.arrow_table_v{format_version}_with_null_partitioned_on_col_{part_col}",
properties={"format-version": str(format_version), "write.object-storage.enabled": True},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use the constant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also mention in test that write.object-storage.partitioned-paths defaults to True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that it's the default, I removed it here (and added a comment for both defaults), so there's one integration test that checks not specifying it. I used the constants in all other places in the integration tests.

tests/integration/test_writes/test_writes.py Show resolved Hide resolved
assert len(parts) == 7
assert parts[0] == "table_location"
assert parts[1] == "data"
# Entropy directories in the middle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this test is called test_object_storage_injects_entropy should we test the entropy part?

similar to


        # Entropy binary directories should have been injected
        for dir_name in parts[6:10]:
            assert dir_name
            assert all(c in "01" for c in dir_name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was inspired by https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/TestLocationProvider.java#L275. fyi, my reading of this was: this tests that there's some stuff in the middle. The later test_hash_injection/ testHashInjection (here vs Java) tests that the hashes themselves are correct.

(To me, it made sense for the integration test to have the balance of checking both entropy and that they're binary-hashed, but not the hash itself because that feels unit-test-y)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fair for this test to check that it's binary too. That way, if e.g. the wrong hash method is used, this unit test still passes (the provider does indeed inject entropy) but the hash-injection is wrong (so that unit test fails).

This sounds good to me, thanks!

@smaheshwar-pltr
Copy link
Contributor Author

This matches the behavior of the Java implementation. However, if we're reusing the same property (write.location-provider.impl), then there's a conflict when loading in both Java and Python. I wonder if we should add a python specific property, otherwise location-provider will only work in one of the implementations and might error in the other.

Great point! I've made the change to WRITE_PY_LOCATION_PROVIDER_IMPL = "write.py-location-provider.impl" (happy to take suggestions) - inspired by io-impl $\to$ py-io-impl.

WRITE_PY_LOCATION_PROVIDER_IMPL = "write.py-location-provider.impl"

OBJECT_STORE_ENABLED = "write.object-storage.enabled"
OBJECT_STORE_ENABLED_DEFAULT = True # Differs from Java + docs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See discussion #1452 (comment))

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, thanks for following up.
I think we'd also want to add docs around this feature! Maybe similar to FileIO, we can add a new section about LocationProvider

Great point! I've made the change to WRITE_PY_LOCATION_PROVIDER_IMPL = "write.py-location-provider.impl" (happy to take suggestions) - inspired by io-impl → py-io-impl.

Bringing this comment up, I want to see what other think of this pattern.

def write_file(io: FileIO, table_metadata: TableMetadata, tasks: Iterator[WriteTask]) -> Iterator[DataFile]:
def write_file(
io: FileIO, location_provider: LocationProvider, table_metadata: TableMetadata, tasks: Iterator[WriteTask]
) -> Iterator[DataFile]:
from pyiceberg.table import DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE, TableProperties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would mean we need to run load_location_provider per data file and can potentially get expensive

Comment on lines 36 to 38
HASH_BINARY_STRING_BITS = 20
ENTROPY_DIR_LENGTH = 4
ENTROPY_DIR_DEPTH = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we had issues dealing with constants in the file itself. https://github.com/apache/iceberg-python/pull/1217/files#diff-942c2f54eac4f30f1a1e2fa18b719e17cc1cb03ad32908a402c4ba3abe9eca63L37-L38

if its only used in ObjectStoreLocationProvider, i think its better to be in the class.

but also this is a nit comment :P

TableProperties.WRITE_OBJECT_STORE_PARTITIONED_PATHS_DEFAULT,
)

def new_data_location(self, data_file_name: str, partition_key: Optional[PartitionKey] = None) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While im in favor of making ObjectStorageLocationProvider the default for pyiceberg, i'd prefer to do so in a follow-up PR.
I like having this PR solely to implement the concept of LocationProvider and the ObjectStorageProvider

return (
f"{prefix}/{hashed_path}/{data_file_name}"
if self._include_partition_paths
else f"{prefix}/{hashed_path}-{data_file_name}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an interesting case, do we have a test to show this behavior explicitly? i think it'll be valuable to refer to it at a later time

mkdocs/docs/api.md Outdated Show resolved Hide resolved
@smaheshwar-pltr
Copy link
Contributor Author

smaheshwar-pltr commented Jan 10, 2025

I think we'd also want to add docs around this feature! Maybe similar to FileIO, we can add a new section about LocationProvider

@kevinjqliu good point. Can we do this in a separate PR? Preferably after the defaults discussion, so then we can say something similar to FileIO that has "By default, PyIceberg will..." maybe.

(BTW, all comments have now been addressed)

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I would like to also include documentations about the LocationProvider, but we do that as a follow up

I think we should document:

LocationProvider
	SimpleLocationProvider
	ObjectStoreLocationProvider
Loading a Custom LocationProvider

And new table properties:

WRITE_PY_LOCATION_PROVIDER_IMPL = "write.py-location-provider.impl"

OBJECT_STORE_ENABLED = "write.object-storage.enabled"
OBJECT_STORE_ENABLED_DEFAULT = False

WRITE_OBJECT_STORE_PARTITIONED_PATHS = "write.object-storage.partitioned-paths"
WRITE_OBJECT_STORE_PARTITIONED_PATHS_DEFAULT = True

@kevinjqliu kevinjqliu merged commit c68b9b1 into apache:main Jan 10, 2025
7 checks passed
@kevinjqliu
Copy link
Contributor

Thanks @smaheshwar-pltr for working on this and @Fokko for the review :)

@smaheshwar-pltr smaheshwar-pltr deleted the location-providers branch January 10, 2025 22:54
@smaheshwar-pltr smaheshwar-pltr restored the location-providers branch January 11, 2025 15:43
@smaheshwar-pltr smaheshwar-pltr deleted the location-providers branch January 11, 2025 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support LocationProviders like the Java Iceberg Reference Implementaiton
4 participants