Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

z-order fails on table that is partitioned by value with space #2834

Closed
junhl opened this issue Aug 27, 2024 · 7 comments · Fixed by #2897
Closed

z-order fails on table that is partitioned by value with space #2834

junhl opened this issue Aug 27, 2024 · 7 comments · Fixed by #2897
Assignees
Labels
binding/python Issues for the Python package bug Something isn't working

Comments

@junhl
Copy link

junhl commented Aug 27, 2024

Environment

Delta-rs version: 0.19.1

Binding: Python

Environment:

  • Cloud provider: GCP
  • OS: MacOS
  • Other:

Bug

When you try to z-order on a table that is partitioned by a field whose value contains space, it errors out.
I was able to reproduce the issue on local filesystem (MacOS) as well as Google Cloud Storage (I didn't try AWS).

What happened:
Create a delta table and partition it by a column, whose values contain space (e.g. {"country": "Costa Rica"}). The partition is successfully created with country=Costa%20Rica to mitigate space in path.

When you perform z-order on the table, the space is not mitigated (e.g. it looks for country=Costa Rica)

What you expected to happen:

z-order is performed successfully with using the mitigated path.

How to reproduce it:

from deltalake import write_deltalake, DeltaTable
import pandas as pd

df = pd.DataFrame(
    {
        "user": ["James", "Anna", "Sara", "Martin"],
        "country": ["United States", "Canada", "Costa Rica", "South Africa"],
        "age": [34, 23, 45, 26],
    }
)

table_path = "./test_table"
write_deltalake(
    table_or_uri=table_path,
    data=df,
    mode="overwrite",
    partition_by=["country"],
)

test_table = DeltaTable(table_path)

# retrieve by partition works fine
partitioned_df = test_table.to_pandas(
    partitions=[("country", "=", "United States")],
)
print(partitioned_df)

# compact works fine
test_table.optimize.compact()

# z-order does not work
test_table.optimize.z_order(columns=["user"])

# leads to FileNotFoundError: Object at location [...]/test_table/country=Costa Rica/part-00001-78751b37-6dc0-4232-8926-9ddd6f8f14f2-c000.snappy.parquet not found: No such file or directory (os error 2)
# but real path is [...]/test_table/country=Costa%20Rica/part-00001-78751b37-6dc0-4232-8926-9ddd6f8f14f2-c000.snappy.parquet

More details:

I tried few other operations available at DeltaTable level. Z-Ordering seems to be the only one that has this pathing issue. I believe the fix needs to be on Rust side.

@junhl junhl added the bug Something isn't working label Aug 27, 2024
@junhl junhl changed the title z-order z-order fails on table that is partitioned by value with space Aug 27, 2024
@rtyler rtyler added the binding/python Issues for the Python package label Aug 30, 2024
@rtyler rtyler self-assigned this Aug 30, 2024
@rtyler
Copy link
Member

rtyler commented Aug 30, 2024

Thank you for providing some reproduction code! I was able to quickly turn that into a pytest case to verify this behavior

@sherlockbeard
Copy link
Contributor

i think hitting a encode here can solve the issue

.map(|file| format!("delta-rs:///{}", file.location))

@rtyler
Copy link
Member

rtyler commented Aug 30, 2024

You're already looking at the same code I am 😄

The locations are already URL encoded. Somewhere between datafusion and us they're getting handled funky, digging in there now 😸

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Aug 30, 2024

@rtyler It seems in build_compaction/z-order_plan both push the object_meta into partition_files, but the TryFrom code for Add actions to ObjectMeta decodes the url path. So each mergeBin contains these objectMeta's with decoded paths

Ok(ObjectMeta {
            location: file_stats.object_store_path(),
impl LogicalFile<'_> {
    /// Path to the files storage location.
    pub fn path(&self) -> Cow<'_, str> {
        percent_decode_str(self.path.value(self.index)).decode_utf8_lossy()
    }

    /// An object store [`Path`] to the file.
    ///
    /// this tries to parse the file string and if that fails, it will return the string as is.
    // TODO assert consisent handling of the paths encoding when reading log data so this logic can be removed.
    pub fn object_store_path(&self) -> Path {
        let path = self.path();
        // Try to preserve percent encoding if possible
        match Path::parse(path.as_ref()) {
            Ok(path) => path,
            Err(_) => Path::from(path.as_ref()),
        }
    }

@jgeraerts
Copy link

could this happen with any character that get's uri-encoded in a partition?

@rtyler
Copy link
Member

rtyler commented Aug 30, 2024

@jgeraerts very likely 🫠

@junhl
Copy link
Author

junhl commented Aug 30, 2024

Thanks for looking into it @rtyler, let me know if I can help. I'm decently familiar with Delta + Python, but Rust is not in my language pool yet. I'm guessing it's just a few lines missed in z-order section given how other operations using the partitions do work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants