Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TableMerger - when_matched_delete() fails when Column names contain special characters #2438

Closed
mkp-jansen opened this issue Apr 22, 2024 · 2 comments · Fixed by #2441
Closed
Labels
bug Something isn't working

Comments

@mkp-jansen
Copy link

Environment

Delta-rs version: 0.16.4

Binding: python

Environment:

  • OS: Windows 11 Business

Bug

What happened:
At our company we are currently thinking about using delta for our sensor data. The package delta-rs provides pretty much all the functionality we need. However, for reasons I won't be able to change we often have column names with two dashes, e.g. "y--1" ("y-1" works). We need to be able to delete data from the delta lake. When using the TableMerger this fails as shown in the example below.

In the documentation it says the following:

Column names with special characters, such as numbers or spaces should be encapsulated in backticks: "target.123column" or "target.my column"

However, there is no argument in "when_matched_delete()" to specifiy the columns with special characters.

What you expected to happen:

I guess the desired behaviour would be that you can simply delete the matching rows, even when the column names contain special characters.

I would be happy to give a fix a shot (also in rust) - but I would need some guidance along the way.

How to reproduce it:

from deltalake import DeltaTable, write_deltalake
import pyarrow as pa

data = pa.table({"x": [1, 2, 3], "y--1": [4, 5, 6]})
write_deltalake("tmp", data)
dt = DeltaTable("tmp")
new_data = pa.table({"x": [2, 3]})

(
    dt.merge(
        source=new_data,
        predicate='target.x = source.x',
        source_alias='source',
        target_alias='target')
    .when_matched_delete()
    .execute()
)
@mkp-jansen mkp-jansen added the bug Something isn't working label Apr 22, 2024
@mkp-jansen
Copy link
Author

I forgot to add the error message:

---------------------------------------------------------------------------
DeltaError                                Traceback (most recent call last)
Cell In[6], line 16
      6 dt = DeltaTable("tmp")
      7 new_data = pa.table({"x": [2, 3]})
      9 (
     10     dt.merge(
     11         source=new_data,
     12         predicate='target.x = source.x',
     13         source_alias='source',
     14         target_alias='target')
     15     .when_matched_delete()
---> 16     .execute()
     17 )

File c:\...\.venv\Lib\site-packages\deltalake\table.py:1778, in TableMerger.execute(self)
   1772 def execute(self) -> Dict[str, Any]:
   1773     """Executes `MERGE` with the previously provided settings in Rust with Apache Datafusion query engine.
   1774 
   1775     Returns:
   1776         Dict: metrics
   1777     """
-> 1778     metrics = self.table._table.merge_execute(
   1779         source=self.source,
   1780         predicate=self.predicate,
   1781         source_alias=self.source_alias,
   1782         target_alias=self.target_alias,
   1783         safe_cast=self.safe_cast,
   1784         writer_properties=self.writer_properties._to_dict()
   1785         if self.writer_properties
   1786         else None,
   1787         custom_metadata=self.custom_metadata,
   1788         matched_update_updates=self.matched_update_updates,
   1789         matched_update_predicate=self.matched_update_predicate,
   1790         matched_delete_predicate=self.matched_delete_predicate,
   1791         matched_delete_all=self.matched_delete_all,
   1792         not_matched_insert_updates=self.not_matched_insert_updates,
   1793         not_matched_insert_predicate=self.not_matched_insert_predicate,
   1794         not_matched_by_source_update_updates=self.not_matched_by_source_update_updates,
   1795         not_matched_by_source_update_predicate=self.not_matched_by_source_update_predicate,
   1796         not_matched_by_source_delete_predicate=self.not_matched_by_source_delete_predicate,
   1797         not_matched_by_source_delete_all=self.not_matched_by_source_delete_all,
   1798     )
   1799     self.table.update_incremental()
   1800     return json.loads(metrics)

DeltaError: Generic DeltaTable error: Schema error: No field named __delta_rs_c_y. Valid fields are source.x, __delta_rs_source, target.x, target."y--1", target.__delta_rs_path, __delta_rs_target, __delta_rs_operation, __delta_rs_c_x, "__delta_rs_c_y--1", __delta_rs_delete, __delta_rs_target_insert, __delta_rs_target_update, __delta_rs_target_delete, __delta_rs_target_copy.

Blajda pushed a commit that referenced this issue Apr 23, 2024
…2441)

# Description
@Blajda I don't think `from_qualified_name_ignore_case` was needed here
since the delta_fields don't have relation information, they are just
the column names.

`from_qualified_name_ignore_case` will try to parse `__delta_rs_c_y--1`
and results into `__delta_rs_c_y`, while `from_name `just keeps the
column name as-is, which is preferred.


# Related Issue(s)
- closes #2438
@mkp-jansen
Copy link
Author

Wow - that was fast! Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant