Partition values that have been url encoded cannot be read when using deltalake #1446

danielkr85 · 2023-06-07T17:24:23Z

Environment

pyarrow 11.0.0
deltalake 0.9.0

Binding: Python

Environment:

Cloud provider: AWS
OS: Linux
Other:

Bug

What happened:

Receiving an error when attempting to read partition values that have a colon

What you expected to happen:

I would expect to be able to read these partitions like I can using pyspark

How to reproduce it:

Create a delta table partitioned by timestamp using PySpark. Example partition value: 2023-06-07 13:00:00. When data is loaded to the partition it creates a folder such as this:

'load_ts=2023-06-07 13%3A00%3A00'

Then when running the below code:

from deltalake import DeltaTable
dt = DeltaTable('/spark-warehouse/mock_table_partitioned')
dt.to_pandas()

I encounter the following error. Notice that in the error message it has url encoded the % resulting in %25253A instead of %3A:

File /usr/local/lib/python3.9/dist-packages/deltalake/table.py:442, in DeltaTable.to_pandas(self, partitions, columns, filesystem)
    428 def to_pandas(
    429     self,
    430     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    431     columns: Optional[List[str]] = None,
    432     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    433 ) -> "pandas.DataFrame":
    434     """
    435     Build a pandas dataframe using data from the DeltaTable.
    436
   (...)
    440     :return: a pandas dataframe
    441     """
--> 442     return self.to_pyarrow_table(
    443         partitions=partitions, columns=columns, filesystem=filesystem
    444     ).to_pandas()

File /usr/local/lib/python3.9/dist-packages/deltalake/table.py:424, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem)
    410 def to_pyarrow_table(
    411     self,
    412     partitions: Optional[List[Tuple[str, str, Any]]] = None,
    413     columns: Optional[List[str]] = None,
    414     filesystem: Optional[Union[str, pa_fs.FileSystem]] = None,
    415 ) -> pyarrow.Table:
    416     """
    417     Build a PyArrow Table using data from the DeltaTable.
    418
   (...)
    422     :return: the PyArrow table
    423     """
--> 424     return self.to_pyarrow_dataset(
    425         partitions=partitions, filesystem=filesystem
    426     ).to_table(columns=columns)

File /usr/local/lib/python3.9/dist-packages/pyarrow/_dataset.pyx:369, in pyarrow._dataset.Dataset.to_table()

File /usr/local/lib/python3.9/dist-packages/pyarrow/_dataset.pyx:2818, in pyarrow._dataset.Scanner.to_table()

File /usr/local/lib/python3.9/dist-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File /usr/local/lib/python3.9/dist-packages/pyarrow/_fs.pyx:1551, in pyarrow._fs._cb_open_input_file()

File /usr/local/lib/python3.9/dist-packages/deltalake/fs.py:22, in DeltaStorageHandler.open_input_file(self, path)
     15 def open_input_file(self, path: str) -> pa.PythonFile:
     16     """
     17     Open an input file for random access reading.
     18
     19     :param source: The source to open for reading.
     20     :return:  NativeFile
     21     """
---> 22     return pa.PythonFile(DeltaFileSystemHandler.open_input_file(self, path))

PyDeltaTableError: Object at location /spark-warehouse/mock_table_partitioned/load_dt=2023-05-12 00%25253A00%25253A00/part-00000-132eef45-d5e6-40e0-be85-383dd7c60b4c.c000.snappy.parquet not found: No such file or directory (os error 2)

More details:

This is happening when using local storage or aws s3. I also do not encounter this issue if reading these partitions from within pyspark.

The text was updated successfully, but these errors were encountered:

wjones127 · 2023-06-07T18:12:34Z

Was this table created in PySpark? Or with deltalake?

danielkr85 · 2023-06-07T18:37:17Z

PySpark

wjones127 · 2023-06-09T16:47:26Z

Looks like we have this test case for this we need to fix:

delta-rs/python/tests/data_acceptance/test_reader.py

Line 48 in 5dc89b3

    
           "multi_partitioned": "Escaped characters in data file paths aren't yet handled (#1079)",

wjones127 · 2023-06-09T16:47:47Z

May be related to #1079

# Description In the delta log, paths are percent encoded. We decode them here: https://github.com/delta-io/delta-rs/blob/787c13a63efa9ada96d303c10c093424215aaa80/rust/src/action/mod.rs#L435-L437 Which is good. But then we've been re-encoding them with `Path::from`. This PR changes to use `Path::parse` when possible instead. Instead of propagating errors, we just fallback to `Path::from` for now. Read more here: https://docs.rs/object_store/0.7.0/object_store/path/struct.Path.html#encode # Related Issue(s) * closes #1533 * closes #1446 * closes #1079 * closes #1393 # Documentation

danielkr85 added the bug Something isn't working label Jun 7, 2023

sherlockbeard mentioned this issue Jul 23, 2023

[Python] Incorrect file URIs when partition values contain escape character #1533

Closed

wjones127 mentioned this issue Sep 3, 2023

fix: don't re-encode paths #1613

Merged

wjones127 closed this as completed in #1613 Sep 11, 2023

mustafahasankhan mentioned this issue Nov 21, 2023

URL encoded partitions are getting re-encoded hence throwing not found errors #1896

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partition values that have been url encoded cannot be read when using deltalake #1446

Partition values that have been url encoded cannot be read when using deltalake #1446

danielkr85 commented Jun 7, 2023 •

edited

Loading

wjones127 commented Jun 7, 2023

danielkr85 commented Jun 7, 2023

wjones127 commented Jun 9, 2023

wjones127 commented Jun 9, 2023

Partition values that have been url encoded cannot be read when using deltalake #1446

Partition values that have been url encoded cannot be read when using deltalake #1446

Comments

danielkr85 commented Jun 7, 2023 • edited Loading

Environment

Bug

wjones127 commented Jun 7, 2023

danielkr85 commented Jun 7, 2023

wjones127 commented Jun 9, 2023

wjones127 commented Jun 9, 2023

danielkr85 commented Jun 7, 2023 •

edited

Loading