Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partition values that have been url encoded cannot be read when using deltalake #1446

Closed
danielkr85 opened this issue Jun 7, 2023 · 4 comments · Fixed by #1613
Closed

Partition values that have been url encoded cannot be read when using deltalake #1446

danielkr85 opened this issue Jun 7, 2023 · 4 comments · Fixed by #1613
Labels
bug Something isn't working

Comments

@danielkr85
Copy link

danielkr85 commented Jun 7, 2023

Environment

pyarrow 11.0.0
deltalake 0.9.0

Binding: Python

Environment:

  • Cloud provider: AWS
  • OS: Linux
  • Other:

Bug

What happened:

Receiving an error when attempting to read partition values that have a colon

What you expected to happen:

I would expect to be able to read these partitions like I can using pyspark

How to reproduce it:

Create a delta table partitioned by timestamp using PySpark. Example partition value: 2023-06-07 13:00:00. When data is loaded to the partition it creates a folder such as this:

'load_ts=2023-06-07 13%3A00%3A00'

Then when running the below code:

from deltalake import DeltaTable
dt = DeltaTable('/spark-warehouse/mock_table_partitioned')
dt.to_pandas()

I encounter the following error. Notice that in the error message it has url encoded the % resulting in %25253A instead of %3A:

File /usr/local/lib/python3.9/dist-packages/deltalake/table.py:442, in DeltaTable.to_pandas(self, partitions, columns, filesystem)
    428 def to_pandas(
    429     self,
    430     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    431     columns: Optional[List[str]] = None,
    432     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    433 ) -> "pandas.DataFrame":
    434     """
    435     Build a pandas dataframe using data from the DeltaTable.
    436
   (...)
    440     :return: a pandas dataframe
    441     """
--> 442     return self.to_pyarrow_table(
    443         partitions=partitions, columns=columns, filesystem=filesystem
    444     ).to_pandas()

File /usr/local/lib/python3.9/dist-packages/deltalake/table.py:424, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem)
    410 def to_pyarrow_table(
    411     self,
    412     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    413     columns: Optional[List[str]] = None,
    414     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    415 ) -> pyarrow.Table:
    416     """
    417     Build a PyArrow Table using data from the DeltaTable.
    418
   (...)
    422     :return: the PyArrow table
    423     """
--> 424     return self.to_pyarrow_dataset(
    425         partitions=partitions, filesystem=filesystem
    426     ).to_table(columns=columns)

File /usr/local/lib/python3.9/dist-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()

File /usr/local/lib/python3.9/dist-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()

File /usr/local/lib/python3.9/dist-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File /usr/local/lib/python3.9/dist-packages/pyarrow/_fs.pyx:1551, in pyarrow._fs._cb_open_input_file()

File /usr/local/lib/python3.9/dist-packages/deltalake/fs.py:22, in DeltaStorageHandler.open_input_file(self, path)
     15 def open_input_file(self, path: str) -> pa.PythonFile:
     16     """
     17     Open an input file for random access reading.
     18
     19     :param source: The source to open for reading.
     20     :return:  NativeFile
     21     """
---> 22     return pa.PythonFile(DeltaFileSystemHandler.open_input_file(self, path))

PyDeltaTableError: Object at location /spark-warehouse/mock_table_partitioned/load_dt=2023-05-12 00%25253A00%25253A00/part-00000-132eef45-d5e6-40e0-be85-383dd7c60b4c.c000.snappy.parquet not found: No such file or directory (os error 2)

More details:

This is happening when using local storage or aws s3. I also do not encounter this issue if reading these partitions from within pyspark.

@danielkr85 danielkr85 added the bug Something isn't working label Jun 7, 2023
@wjones127
Copy link
Collaborator

Was this table created in PySpark? Or with deltalake?

@danielkr85
Copy link
Author

PySpark

@wjones127
Copy link
Collaborator

Looks like we have this test case for this we need to fix:

"multi_partitioned": "Escaped characters in data file paths aren't yet handled (#1079)",

@wjones127
Copy link
Collaborator

May be related to #1079

wjones127 added a commit that referenced this issue Sep 11, 2023
# Description

In the delta log, paths are percent encoded. We decode them here:


https://github.com/delta-io/delta-rs/blob/787c13a63efa9ada96d303c10c093424215aaa80/rust/src/action/mod.rs#L435-L437

Which is good. But then we've been re-encoding them with `Path::from`.
This PR changes to use `Path::parse` when possible instead. Instead of
propagating errors, we just fallback to `Path::from` for now. Read more
here:
https://docs.rs/object_store/0.7.0/object_store/path/struct.Path.html#encode

# Related Issue(s)

* closes #1533
* closes #1446 
* closes #1079
* closes #1393


# Documentation

<!---
Share links to useful documentation
--->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants