Support Location Providers #1452

smaheshwar-pltr · 2024-12-20T01:20:30Z

Closes #861.

As the issue suggests, introduces a LocationProvider interface with the default and object-store-optimised implementations (the latter can be enabled via the newly-introduced table properties). This is pluggable, just like FileIO.

Largely inspired by and consistent with the Java implementation.

smaheshwar-pltr · 2024-12-20T13:27:11Z

pyiceberg/table/__init__.py

+        module_name, class_name = ".".join(path_parts[:-1]), path_parts[-1]
+        module = importlib.import_module(module_name)
+        class_ = getattr(module, class_name)


Hmm, wonder if we should reduce duplication between this and file IO loading.

smaheshwar-pltr · 2024-12-20T13:28:18Z

pyiceberg/io/pyarrow.py

@@ -2622,13 +2631,15 @@ def _dataframe_to_data_files(
        property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
        default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT,
    )
+    location_provider = load_location_provider(table_location=table_metadata.location, table_properties=table_metadata.properties)


Don't love this. I wanted to do something like this and cache on at least the Transaction (which this method is exclusively invoked by) but the problem I think is that properties can change on the Transaction, potentially changing the location provider to be used. I suppose we can update that provider on a property change (or maybe any metadata change) but unsure if this complexity is even worth it.

thats an interesting edge case. it seems like an anti-pattern to change the table property and write in the same transaction, although its currently allowed

3555932 (fyi the Java tests don't have one)

smaheshwar-pltr · 2024-12-20T13:56:25Z

pyiceberg/table/locations.py

+from pyiceberg.utils.properties import property_as_bool
+
+
+class DefaultLocationProvider(LocationProvider):


The biggest difference vs the Java implementations is that I've not supported write.data.path here. I think it's natural for write.metadata.path to be supported alongside this so this would be a larger and arguably location-provider-independent change? Can look into it as a follow-up.

thanks! would be great to have write.data.path and write.metadata.path

opened an issue on supporting write.data.path and write.metadata.path
#1492

Sorry guys, didn't notice this thread until now.

smaheshwar-pltr · 2024-12-20T14:04:50Z

pyiceberg/table/__init__.py

@@ -192,6 +195,14 @@ class TableProperties:
    WRITE_PARTITION_SUMMARY_LIMIT = "write.summary.partition-limit"
    WRITE_PARTITION_SUMMARY_LIMIT_DEFAULT = 0

+    WRITE_LOCATION_PROVIDER_IMPL = "write.location-provider.impl"


Though the docs say that the default is null, having a constant for this being None felt unnecessary

smaheshwar-pltr · 2024-12-20T14:06:14Z

pyiceberg/table/locations.py

+        return (
+            f"{prefix}/{hashed_path}/{data_file_name}"
+            if self._include_partition_paths
+            else f"{prefix}/{hashed_path}-{data_file_name}"


Interesting that disabling include_partition_paths affects paths of non-partitioned data files. I've matched Java behaviour here but it does feel odd.

this is an interesting case, do we have a test to show this behavior explicitly? i think it'll be valuable to refer to it at a later time

smaheshwar-pltr · 2024-12-20T14:09:18Z

pyiceberg/table/locations.py

+            TableProperties.WRITE_OBJECT_STORE_PARTITIONED_PATHS_DEFAULT,
+        )
+
+    def new_data_location(self, data_file_name: str, partition_key: Optional[PartitionKey] = None) -> str:


Tried to make this as consistent with its Java counter-part so file locations are consistent too. This means hashing on both the partition key and the data file name below, and using the same hash function.

Seemed reasonable to port over the the object storage stuff in this PR, given that the original issue #861 mentions this.

Since Iceberg is mainly focussed on object-stores, I'm leaning towards making the ObjectStorageLocationProvider the default. Java is a great source of inspiration, but it also holds a lot of historical decisions that are not easy to change, so we should reconsider this at PyIceberg.

Thanks for this great suggestion and context! I agree:

I made this the default. The MANIFEST_MERGE_ENABLED_DEFAULT property already differs from Java and the docs which reassures me. I did still add a short comment beside OBJECT_STORE_ENABLED_DEFAULT to indicate that it differs.

I renamed DefaultLocationProvider to SimpleLocationProvider because it's no longer the default

^ cc @kevinjqliu, how does this sound to you? I realise the concerns you raised re things silently working differently with Java and PyIceberg seem a little contradicting with the above (but I think it's fine).

Also, I've not yet changed WRITE_OBJECT_STORE_PARTITIONED_PATHS_DEFAULT to False (Java/docs have true) even though that's more aligned with object storage - from the docs:

We have also added a new table property write.object-storage.partitioned-paths that if set to false(default=true), this will omit the partition values from the file path. Iceberg does not need these values in the file path and setting this value to false can further reduce the key size.

I'm very open to be swayed / discuss this. After reading through apache/iceberg#11112 it seems there was a strong case for still supporting partition values in paths though I haven't been able to flesh it out fully. Perhaps it's backwards compatibility, for folks that inspect storage to see how their files are actually laid out; it does group them together nicely.

I'd be happy to change the default if there's reason for it. The readability of file paths will arguably anyway decrease with these hashes so the above might be a non-issue.

While im in favor of making ObjectStorageLocationProvider the default for pyiceberg, i'd prefer to do so in a follow-up PR.
I like having this PR solely to implement the concept of LocationProvider and the ObjectStorageProvider

While im in favor of making ObjectStorageLocationProvider the default for pyiceberg, i'd prefer to do so in a follow-up PR.
I like having this PR solely to implement the concept of LocationProvider and the ObjectStorageProvider

Makes sense! We can have the discussion regarding defaults there. I'd like to keep the SimpleLocationProvider naming change from Default here though and discuss which provider should be the default in the next PR.

smaheshwar-pltr · 2024-12-20T20:28:28Z

tests/table/test_locations.py

+    # Field name is not encoded but partition value is - this differs from the Java implementation
+    # https://github.com/apache/iceberg/blob/cdf748e8e5537f13d861aa4c617a51f3e11dc97c/core/src/test/java/org/apache/iceberg/TestLocationProvider.java#L304
+    assert partition_segment == "part#field=example%23val"


Put up #1457 - I'll remove this special-character testing (that the Java test counterpart does) here because it'll be tested in that PR.

smaheshwar-pltr · 2024-12-20T20:29:57Z

tests/table/test_locations.py

+        return f"custom_location_provider/{data_file_name}"
+
+
+def test_default_location_provider() -> None:


The tests in this file are inspired by https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/TestLocationProvider.java.

The hash functions are the same so those constants are unchanged.

smaheshwar-pltr · 2024-12-23T15:29:56Z

@Fokko, think this is ready for review now!

I've implemented this for write codepaths - add_files seems like it should just add the files specified without transforming locations.

pyiceberg/table/__init__.py

kevinjqliu

Thanks for the PR! Generally LGTM, i left a few nit comments.

This matches the behavior of the Java implementation. However, if we're reusing the same property (write.location-provider.impl), then there's a conflict when loading in both Java and Python. I wonder if we should add a python specific property, otherwise location-provider will only work in one of the implementations and might error in the other.

kevinjqliu · 2025-01-02T20:14:23Z

pyiceberg/table/locations.py

+from pyiceberg.utils.properties import property_as_bool
+
+
+class DefaultLocationProvider(LocationProvider):


thanks! would be great to have write.data.path and write.metadata.path

kevinjqliu · 2025-01-02T20:15:09Z

pyiceberg/table/locations.py

+HASH_BINARY_STRING_BITS = 20
+ENTROPY_DIR_LENGTH = 4
+ENTROPY_DIR_DEPTH = 3


nit: move these into ObjectStoreLocationProvider

Makes sense esp given the file has now grown. It's pretty unreadable to prefix all the constants here with ObjectStoreLocationProvider though - I'll think about this.

we had issues dealing with constants in the file itself. https://github.com/apache/iceberg-python/pull/1217/files#diff-942c2f54eac4f30f1a1e2fa18b719e17cc1cb03ad32908a402c4ba3abe9eca63L37-L38

if its only used in ObjectStoreLocationProvider, i think its better to be in the class.

but also this is a nit comment :P

I fully agree that it should be within the class - will find a way to do it readably 👍

kevinjqliu · 2025-01-02T20:19:09Z

pyiceberg/io/pyarrow.py

@@ -2622,13 +2631,15 @@ def _dataframe_to_data_files(
        property_name=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
        default=TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT,
    )
+    location_provider = load_location_provider(table_location=table_metadata.location, table_properties=table_metadata.properties)


thats an interesting edge case. it seems like an anti-pattern to change the table property and write in the same transaction, although its currently allowed

kevinjqliu

Generally LGTM, added a few nit comments

kevinjqliu · 2025-01-09T18:22:06Z

pyiceberg/io/pyarrow.py

-def write_file(io: FileIO, table_metadata: TableMetadata, tasks: Iterator[WriteTask]) -> Iterator[DataFile]:
+def write_file(
+    io: FileIO, location_provider: LocationProvider, table_metadata: TableMetadata, tasks: Iterator[WriteTask]
+) -> Iterator[DataFile]:
    from pyiceberg.table import DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE, TableProperties


we might want location_provider: LocationProvider last for backwards compatibility

WDYT about leaving the signature as before and doing load_location_provider at the start of this function (above parquet_writer_kwargs = _get_parquet_writer_kwargs(table_metadata.properties) instead of in _dataframe_to_data_files?

that would mean we need to run load_location_provider per data file and can potentially get expensive

I don't think so? At the start of the function means not in write_parquet - the location_provider loaded would be just be used within that, similar to parquet_writer_kwargs.

ah makes sense, write_parquet is called once per _dataframe_to_data_files

we can do that to preserve backwards compatibility

Sounds good! (typo correction: write_file above 😄)

tests/integration/test_writes/test_partitioned_writes.py

kevinjqliu · 2025-01-09T18:30:59Z

tests/integration/test_writes/test_partitioned_writes.py

+    tbl = _create_table(
+        session_catalog=session_catalog,
+        identifier=f"default.arrow_table_v{format_version}_with_null_partitioned_on_col_{part_col}",
+        properties={"format-version": str(format_version), "write.object-storage.enabled": True},


nit: use the constant

also mention in test that write.object-storage.partitioned-paths defaults to True

Now that it's the default, I removed it here (and added a comment for both defaults), so there's one integration test that checks not specifying it. I used the constants in all other places in the integration tests.

tests/integration/test_writes/test_writes.py

kevinjqliu · 2025-01-09T18:35:17Z

tests/table/test_locations.py

+    assert len(parts) == 7
+    assert parts[0] == "table_location"
+    assert parts[1] == "data"
+    # Entropy directories in the middle


since this test is called test_object_storage_injects_entropy should we test the entropy part?

similar to

# Entropy binary directories should have been injected for dir_name in parts[6:10]: assert dir_name assert all(c in "01" for c in dir_name)

This test was inspired by https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/TestLocationProvider.java#L275. fyi, my reading of this was: this tests that there's some stuff in the middle. The later test_hash_injection/ testHashInjection (here vs Java) tests that the hashes themselves are correct.

(To me, it made sense for the integration test to have the balance of checking both entropy and that they're binary-hashed, but not the hash itself because that feels unit-test-y)

I think it's fair for this test to check that it's binary too. That way, if e.g. the wrong hash method is used, this unit test still passes (the provider does indeed inject entropy) but the hash-injection is wrong (so that unit test fails).

This sounds good to me, thanks!

Co-authored-by: Kevin Liu <[email protected]>

smaheshwar-pltr · 2025-01-09T20:08:40Z

This matches the behavior of the Java implementation. However, if we're reusing the same property (write.location-provider.impl), then there's a conflict when loading in both Java and Python. I wonder if we should add a python specific property, otherwise location-provider will only work in one of the implementations and might error in the other.

Great point! I've made the change to WRITE_PY_LOCATION_PROVIDER_IMPL = "write.py-location-provider.impl" (happy to take suggestions) - inspired by io-impl $\to$ py-io-impl.

smaheshwar-pltr · 2025-01-10T13:03:37Z

pyiceberg/table/__init__.py

+    WRITE_PY_LOCATION_PROVIDER_IMPL = "write.py-location-provider.impl"
+
+    OBJECT_STORE_ENABLED = "write.object-storage.enabled"
+    OBJECT_STORE_ENABLED_DEFAULT = True  # Differs from Java + docs


(See discussion #1452 (comment))

kevinjqliu

Generally LGTM, thanks for following up.
I think we'd also want to add docs around this feature! Maybe similar to FileIO, we can add a new section about LocationProvider

Great point! I've made the change to WRITE_PY_LOCATION_PROVIDER_IMPL = "write.py-location-provider.impl" (happy to take suggestions) - inspired by io-impl → py-io-impl.

Bringing this comment up, I want to see what other think of this pattern.

kevinjqliu · 2025-01-10T19:16:19Z

pyiceberg/io/pyarrow.py

-def write_file(io: FileIO, table_metadata: TableMetadata, tasks: Iterator[WriteTask]) -> Iterator[DataFile]:
+def write_file(
+    io: FileIO, location_provider: LocationProvider, table_metadata: TableMetadata, tasks: Iterator[WriteTask]
+) -> Iterator[DataFile]:
    from pyiceberg.table import DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE, TableProperties


that would mean we need to run load_location_provider per data file and can potentially get expensive

kevinjqliu · 2025-01-10T19:19:18Z

pyiceberg/table/locations.py

+HASH_BINARY_STRING_BITS = 20
+ENTROPY_DIR_LENGTH = 4
+ENTROPY_DIR_DEPTH = 3


we had issues dealing with constants in the file itself. https://github.com/apache/iceberg-python/pull/1217/files#diff-942c2f54eac4f30f1a1e2fa18b719e17cc1cb03ad32908a402c4ba3abe9eca63L37-L38

if its only used in ObjectStoreLocationProvider, i think its better to be in the class.

but also this is a nit comment :P

kevinjqliu · 2025-01-10T19:22:49Z

pyiceberg/table/locations.py

+            TableProperties.WRITE_OBJECT_STORE_PARTITIONED_PATHS_DEFAULT,
+        )
+
+    def new_data_location(self, data_file_name: str, partition_key: Optional[PartitionKey] = None) -> str:


While im in favor of making ObjectStorageLocationProvider the default for pyiceberg, i'd prefer to do so in a follow-up PR.
I like having this PR solely to implement the concept of LocationProvider and the ObjectStorageProvider

kevinjqliu · 2025-01-10T19:24:28Z

pyiceberg/table/locations.py

+        return (
+            f"{prefix}/{hashed_path}/{data_file_name}"
+            if self._include_partition_paths
+            else f"{prefix}/{hashed_path}-{data_file_name}"


this is an interesting case, do we have a test to show this behavior explicitly? i think it'll be valuable to refer to it at a later time

mkdocs/docs/api.md

smaheshwar-pltr · 2025-01-10T21:13:30Z

I think we'd also want to add docs around this feature! Maybe similar to FileIO, we can add a new section about LocationProvider

@kevinjqliu good point. Can we do this in a separate PR? Preferably after the defaults discussion, so then we can say something similar to FileIO that has "By default, PyIceberg will..." maybe.

(BTW, all comments have now been addressed)

…ecial case

kevinjqliu

LGTM!

I would like to also include documentations about the LocationProvider, but we do that as a follow up

I think we should document:

LocationProvider
	SimpleLocationProvider
	ObjectStoreLocationProvider
Loading a Custom LocationProvider

And new table properties:

WRITE_PY_LOCATION_PROVIDER_IMPL = "write.py-location-provider.impl"

OBJECT_STORE_ENABLED = "write.object-storage.enabled"
OBJECT_STORE_ENABLED_DEFAULT = False

WRITE_OBJECT_STORE_PARTITIONED_PATHS = "write.object-storage.partitioned-paths"
WRITE_OBJECT_STORE_PARTITIONED_PATHS_DEFAULT = True

kevinjqliu · 2025-01-10T22:34:01Z

Thanks @smaheshwar-pltr for working on this and @Fokko for the review :)

Skeletal implementation

adfbd3c

smaheshwar-pltr changed the title ~~WIP: Support LocationProviders~~ WIP: Support LocationProviders Dec 20, 2024

smaheshwar-pltr changed the title ~~WIP: Support LocationProviders~~ WIP: Support Location Providers Dec 20, 2024

Sreesh Maheshwar added 2 commits December 20, 2024 03:11

First attempt at hashing locations

ea2b456

Relocate to table submodule; code and comment improvements

ce5f0d5

smaheshwar-pltr commented Dec 20, 2024

View reviewed changes

smaheshwar-pltr changed the title ~~WIP: Support Location Providers~~ Support Location Providers Dec 20, 2024

smaheshwar-pltr commented Dec 20, 2024

View reviewed changes

smaheshwar-pltr marked this pull request as ready for review December 20, 2024 14:09

Sreesh Maheshwar added 2 commits December 20, 2024 20:22

Add unit tests

d3e0c0f

Remove entropy check

00917e9

smaheshwar-pltr commented Dec 20, 2024

View reviewed changes

smaheshwar-pltr and others added 2 commits December 20, 2024 20:33

Merge branch 'main' into location-providers

c4e6be9

Nit: Prefer self.table_properties

bc2eab8

smaheshwar-pltr mentioned this pull request Dec 20, 2024

(Potential Bug) Partition field names are not URL-encoded in file locations #1458

Closed

Remove special character testing

9999cbb

smaheshwar-pltr force-pushed the location-providers branch from fc674f4 to d9e6c6a Compare December 23, 2024 15:18

Add integration tests for writes

23ef8f5

smaheshwar-pltr force-pushed the location-providers branch from fcea1ec to 23ef8f5 Compare December 23, 2024 15:22

smaheshwar-pltr mentioned this pull request Dec 23, 2024

Support LocationProviders like the Java Iceberg Reference Implementaiton #861

Closed

kevinjqliu self-requested a review December 23, 2024 19:03

Fokko reviewed Dec 29, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

kevinjqliu approved these changes Jan 2, 2025

View reviewed changes

kevinjqliu self-requested a review January 2, 2025 20:32

Sreesh Maheshwar added 3 commits January 9, 2025 16:01

Nit: tiny for loop refactor

45391de

Fix typo

065bcbf

Object storage as default location provider

e5214d4

kevinjqliu reviewed Jan 9, 2025

View reviewed changes

smaheshwar-pltr and others added 5 commits January 9, 2025 18:59

Update tests/integration/test_writes/test_partitioned_writes.py

568af55

Co-authored-by: Kevin Liu <[email protected]>

Test entropy in test_object_storage_injects_entropy

e77af29

Refactor integration tests to use properties and omit when default once

651aaea

Use a different table property for custom location provision

5bfa24b

write.location-provider.py-impl -> write.py-location-provider.impl

8cd46fa

smaheshwar-pltr commented Jan 10, 2025

View reviewed changes

kevinjqliu reviewed Jan 10, 2025

View reviewed changes

Sreesh Maheshwar added 2 commits January 10, 2025 20:06

Merge branch 'main' into location-providers

3dbb8d0

Make lint

e992c24

smaheshwar-pltr commented Jan 10, 2025

View reviewed changes

mkdocs/docs/api.md Outdated Show resolved Hide resolved

Sreesh Maheshwar added 3 commits January 10, 2025 20:11

Move location provider loading into write_file for back-compat

f1e4a31

Make object storage no longer the default

46dd7ab

Merge branch 'main' into location-providers

490d08c

Fokko approved these changes Jan 10, 2025

View reviewed changes

Sreesh Maheshwar added 2 commits January 10, 2025 21:23

Add test case for partitioned paths disabled but with no partition sp…

3555932

…ecial case

Moved constants within ObjectStoreLocationProvider

55d6c4f

kevinjqliu approved these changes Jan 10, 2025

View reviewed changes

kevinjqliu merged commit c68b9b1 into apache:main Jan 10, 2025
7 checks passed

smaheshwar-pltr deleted the location-providers branch January 10, 2025 22:54

smaheshwar-pltr restored the location-providers branch January 11, 2025 15:43

smaheshwar-pltr deleted the location-providers branch January 11, 2025 15:48

This was referenced Jan 11, 2025

Use ObjectStoreLocationProvider by default #1509

Merged

Documentation for Location Providers #1510

Open

		from pyiceberg.utils.properties import property_as_bool


		class DefaultLocationProvider(LocationProvider):

		return f"custom_location_provider/{data_file_name}"


		def test_default_location_provider() -> None:

Support Location Providers #1452

Support Location Providers #1452

Conversation

smaheshwar-pltr commented Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smaheshwar-pltr Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

smaheshwar-pltr Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smaheshwar-pltr Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smaheshwar-pltr Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smaheshwar-pltr Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smaheshwar-pltr Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

smaheshwar-pltr Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

smaheshwar-pltr commented Dec 23, 2024

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smaheshwar-pltr Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smaheshwar-pltr commented Jan 9, 2025

Choose a reason for hiding this comment

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smaheshwar-pltr commented Jan 10, 2025 • edited Loading

kevinjqliu left a comment

Choose a reason for hiding this comment

kevinjqliu commented Jan 10, 2025

smaheshwar-pltr commented Dec 20, 2024 •

edited

Loading

smaheshwar-pltr Jan 10, 2025 •

edited

Loading

smaheshwar-pltr Dec 20, 2024 •

edited

Loading

smaheshwar-pltr Dec 20, 2024 •

edited

Loading

smaheshwar-pltr Dec 20, 2024 •

edited

Loading

smaheshwar-pltr Jan 9, 2025 •

edited

Loading

smaheshwar-pltr Dec 20, 2024 •

edited

Loading

smaheshwar-pltr Dec 20, 2024 •

edited

Loading

smaheshwar-pltr Jan 9, 2025 •

edited

Loading

smaheshwar-pltr commented Jan 10, 2025 •

edited

Loading