Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyArrow: Don't enforce the schema #902

Merged
merged 14 commits into from
Jul 11, 2024

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Jul 8, 2024

PyIceberg struggled with different types of arrow, such as the string and large_string. They represent the same but are different under the hood.

My take is that we should hide this kind of detail from the user as much as possible. Now we went down the road of passing in the Iceberg schema into Arrow, but when doing this, Iceberg has to decide if it is a large or non-large type.

This PR removes passing down the schema to let Arrow decide unless:

  • The type should be evolved
  • In case of re-ordering, we reorder the original types

Let me know what you think @syun64

PyIceberg struggled with the different type of arrow, such as
the `string` and `large_string`. They represent the same, but are
different under the hood.

My take is that we should hide these kind of details from the user
as much as possible. Now we went down the road of passing in the
Iceberg schema into Arrow, but when doing this, Iceberg has to
decide if it is a large or non-large type.

This PR removes passing down the schema in order to let Arrow decide
unless:

 - The type should be evolved
 - In case of re-ordering, we reorder the original types
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -1170,7 +1167,7 @@ def project_table(
if len(tables) < 1:
return pa.Table.from_batches([], schema=schema_to_pyarrow(projected_schema, include_field_ids=False))

result = pa.concat_tables(tables)
result = pa.concat_tables(tables, promote_options="permissive")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko
Copy link
Contributor Author

Fokko commented Jul 8, 2024

I'm aware of the failing CI. Looking into this. It looks like we can automatically cast in the RecordBatchReader.

Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Fokko - thank you very much for taking another look into this issue. I've left some comments on what I've found to be issues in following this approach when I was working on the original PR.

@@ -1895,7 +1896,7 @@ def to_arrow_batch_reader(self) -> pa.RecordBatchReader:
case_sensitive=self.case_sensitive,
limit=self.limit,
),
)
).cast(target_schema=target_schema)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I had originally worked on #786 I thought of this approach as well, but ran into issues like:

tests/integration/test_reads.py::test_pyarrow_batches_deletes[session_catalog_hive] - pyarrow.lib.ArrowTypeError: Field 0 cannot be cast from date32[day] to date32[day]

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/ipc.pxi", line 800, in pyarrow.lib.RecordBatchReader.cast
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Field 0 cannot be cast from date32[day] to date32[day]

As a workaround, I opted to cast each pa.Array individually within to_requested_schema, rather than using this API.

This issue is fixed in apache/arrow#41884, but until we use the new release, I don't think we will be able to use this approach

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, good to see that it has been fixed. I was also pushing a patch: apache/arrow#43183 The RC0 of Arrow 17 has been cut, so the release should be there anytime soon

@@ -1884,8 +1884,9 @@ def to_arrow_batch_reader(self) -> pa.RecordBatchReader:

from pyiceberg.io.pyarrow import project_batches, schema_to_pyarrow

target_schema = schema_to_pyarrow(self.projection())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we are making an opinionated decision on whether we are using large or small type as the pyarrow schema when reading the Iceberg table as a RecordBatchReader. Is there a reason why we don't want to do the same for the table API? I've noticed that we've changed the return type of the Table API to Optional[pa.Table] in order to avoid having to use schema_to_pyarrow.

Similarly, other libraries like polars use the approach of choosing one type over the other (large types in the case of polars).

>>> strings = pa.array(["a", "b"])
>>> pydict = {"strings": strings}
>>> pa.Table.from_pydict(pydict)
pyarrow.Table
strings: string
----
strings: [["a","b"]]
>>> pq.write_table(pa.Table.from_pydict(pydict), "strings.parquet")
>>> pldf = pl.read_parquet("strings.parquet", use_pyarrow=True)
>>> pldf.dtypes
[String]
>>> pldf.to_arrow()
pyarrow.Table
strings: large_string
----
strings: [["a","b"]]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to let Arrow decide. For Polars it is different because they are also the query engine. Casting the types will recompute the buffers, consuming additional memory/CPU, which I would rather avoid.

For the table, we first materialize all the batches in memory, so if one of them is large, it will automatically upcast, otherwise, it will keep the small types.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My knowledge on Parquet data to Arrow buffer conversion is less versed, so please do check me if I am not making much sense 🙂

But are we actually casting the types on read?

We make a decision on whether we are choosing to read with large or small types when instantiating the fragment scanner, which loads the parquet data into the Arrow buffers. The schema_to_pyarrow() calls to pa.Table or pa.RecordBatchReader or in to_requested_schema following that all represent the Table schema in the consistent (large or small) format which shouldn't result in any additional casting and reassigning of buffers.

I think the only time we are casting the types is on write, where we may want to downcast it for forward compatibility. It looks like we have to choose a schema to use on write anyways, because using a schema for the ParquetWriter that isn't consistent with the schema within the dataframe results in an exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only time we are casting the types is on write, where we may want to downcast it for forward compatibility.

+1 Currently, we use "large_*" types during write. I think it could be better if we can write file based on the input pyarrow dataframe schema: if the dataframe is string, we also write with string

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved
if field.doc:
metadata[PYARROW_FIELD_DOC_KEY] = field.doc
if self._include_field_ids:
metadata[PYARROW_PARQUET_FIELD_ID_KEY] = str(field.field_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch on this one as well 👍

Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇 LGTM - thank you @Fokko for putting this together

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -1884,8 +1884,9 @@ def to_arrow_batch_reader(self) -> pa.RecordBatchReader:

from pyiceberg.io.pyarrow import project_batches, schema_to_pyarrow

target_schema = schema_to_pyarrow(self.projection())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only time we are casting the types is on write, where we may want to downcast it for forward compatibility.

+1 Currently, we use "large_*" types during write. I think it could be better if we can write file based on the input pyarrow dataframe schema: if the dataframe is string, we also write with string

@Fokko Fokko merged commit 1b9b884 into apache:main Jul 11, 2024
7 checks passed
@Fokko Fokko deleted the fd-remove-pushing-down-schema branch July 11, 2024 10:45
felixscherz added a commit to felixscherz/iceberg-python that referenced this pull request Jul 17, 2024
commit 1ed3abd
Author: Sung Yun <[email protected]>
Date:   Wed Jul 17 02:04:52 2024 -0400

    Allow writing `pa.Table` that are either a subset of table schema or in arbitrary order, and support type promotion on write (apache#921)

    * merge

    * thanks @HonahX :)

    Co-authored-by: Honah J. <[email protected]>

    * support promote

    * revert promote

    * use a visitor

    * support promotion on write

    * fix

    * Thank you @Fokko !

    Co-authored-by: Fokko Driesprong <[email protected]>

    * revert

    * add-files promotiontest

    * support promote for add_files

    * add tests for uuid

    * add_files subset schema test

    ---------

    Co-authored-by: Honah J. <[email protected]>
    Co-authored-by: Fokko Driesprong <[email protected]>

commit 0f2e19e
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Jul 15 23:25:08 2024 -0700

    Bump zstandard from 0.22.0 to 0.23.0 (apache#934)

    Bumps [zstandard](https://github.com/indygreg/python-zstandard) from 0.22.0 to 0.23.0.
    - [Release notes](https://github.com/indygreg/python-zstandard/releases)
    - [Changelog](https://github.com/indygreg/python-zstandard/blob/main/docs/news.rst)
    - [Commits](indygreg/python-zstandard@0.22.0...0.23.0)

    ---
    updated-dependencies:
    - dependency-name: zstandard
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit ec73d97
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Jul 15 23:24:47 2024 -0700

    Bump griffe from 0.47.0 to 0.48.0 (apache#933)

    Bumps [griffe](https://github.com/mkdocstrings/griffe) from 0.47.0 to 0.48.0.
    - [Release notes](https://github.com/mkdocstrings/griffe/releases)
    - [Changelog](https://github.com/mkdocstrings/griffe/blob/main/CHANGELOG.md)
    - [Commits](mkdocstrings/griffe@0.47.0...0.48.0)

    ---
    updated-dependencies:
    - dependency-name: griffe
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit d05a423
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Jul 15 23:24:16 2024 -0700

    Bump mkdocs-material from 9.5.28 to 9.5.29 (apache#932)

    Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.28 to 9.5.29.
    - [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
    - [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
    - [Commits](squidfunk/mkdocs-material@9.5.28...9.5.29)

    ---
    updated-dependencies:
    - dependency-name: mkdocs-material
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit e27cd90
Author: Yair Halevi (Spock) <[email protected]>
Date:   Sun Jul 14 22:11:04 2024 +0300

    Allow empty `names` in mapped field of Name Mapping (apache#927)

    * Remove check_at_least_one field validator

    Iceberg spec permits an emtpy list of names in the default name mapping. check_at_least_one is therefore unnecessary.

    * Remove irrelevant test case

    * Fixing pydantic model

    No longer requiring minimum length of names list to be 1.

    * Added test case for empty names in name mapping

    * Fixed formatting error

commit 3f44dfe
Author: Soumya Ghosh <[email protected]>
Date:   Sun Jul 14 00:35:38 2024 +0530

    Lowercase bool values in table properties (apache#924)

commit b11cdb5
Author: Sung Yun <[email protected]>
Date:   Fri Jul 12 16:45:04 2024 -0400

    Deprecate to_requested_schema (apache#918)

    * deprecate to_requested_schema

    * prep for release

commit a3dd531
Author: Honah J <[email protected]>
Date:   Fri Jul 12 13:14:40 2024 -0700

    Glue endpoint config variable, continue apache#530 (apache#920)

    Co-authored-by: Seb Pretzer <[email protected]>

commit 32e8f88
Author: Sung Yun <[email protected]>
Date:   Fri Jul 12 15:26:00 2024 -0400

    support PyArrow timestamptz with Etc/UTC (apache#910)

    Co-authored-by: Fokko Driesprong <[email protected]>

commit f6d56e9
Author: Sung Yun <[email protected]>
Date:   Fri Jul 12 05:31:06 2024 -0400

    fix invalidation logic (apache#911)

commit 6488ad8
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Thu Jul 11 22:56:48 2024 -0700

    Bump coverage from 7.5.4 to 7.6.0 (apache#917)

    Bumps [coverage](https://github.com/nedbat/coveragepy) from 7.5.4 to 7.6.0.
    - [Release notes](https://github.com/nedbat/coveragepy/releases)
    - [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst)
    - [Commits](nedbat/coveragepy@7.5.4...7.6.0)

    ---
    updated-dependencies:
    - dependency-name: coverage
      dependency-type: direct:development
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit dceedfa
Author: Sung Yun <[email protected]>
Date:   Thu Jul 11 20:32:14 2024 -0400

    Check if schema is compatible in `add_files` API (apache#907)

    Co-authored-by: Fokko Driesprong <[email protected]>

commit aceed2a
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Thu Jul 11 15:52:06 2024 +0200

    Bump mypy-boto3-glue from 1.34.136 to 1.34.143 (apache#912)

    Bumps [mypy-boto3-glue](https://github.com/youtype/mypy_boto3_builder) from 1.34.136 to 1.34.143.
    - [Release notes](https://github.com/youtype/mypy_boto3_builder/releases)
    - [Commits](https://github.com/youtype/mypy_boto3_builder/commits)

    ---
    updated-dependencies:
    - dependency-name: mypy-boto3-glue
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 1b9b884
Author: Fokko Driesprong <[email protected]>
Date:   Thu Jul 11 12:45:20 2024 +0200

    PyArrow: Don't enforce the schema when reading/writing (apache#902)

    * PyArrow: Don't enforce the schema

    PyIceberg struggled with the different type of arrow, such as
    the `string` and `large_string`. They represent the same, but are
    different under the hood.

    My take is that we should hide these kind of details from the user
    as much as possible. Now we went down the road of passing in the
    Iceberg schema into Arrow, but when doing this, Iceberg has to
    decide if it is a large or non-large type.

    This PR removes passing down the schema in order to let Arrow decide
    unless:

     - The type should be evolved
     - In case of re-ordering, we reorder the original types

    * WIP

    * Reuse Table schema

    * Make linter happy

    * Squash some bugs

    * Thanks Sung!

    Co-authored-by: Sung Yun <[email protected]>

    * Moar code moar bugs

    * Remove the variables wrt file sizes

    * Linting

    * Go with large ones for now

    * Missed one there!

    ---------

    Co-authored-by: Sung Yun <[email protected]>

commit 8f47dfd
Author: Soumya Ghosh <[email protected]>
Date:   Thu Jul 11 11:52:55 2024 +0530

    Move determine_partitions and helper methods to io.pyarrow (apache#906)

commit 5aa451d
Author: Soumya Ghosh <[email protected]>
Date:   Thu Jul 11 07:57:05 2024 +0530

    Rename data_sequence_number to sequence_number in ManifestEntry (apache#900)

commit 77a07c9
Author: Honah J <[email protected]>
Date:   Wed Jul 10 03:56:13 2024 -0700

    Support MergeAppend operations (apache#363)

    * add ListPacker + tests

    * add merge append

    * add merge_append

    * fix snapshot inheritance

    * test manifest file and entries

    * add doc

    * fix lint

    * change test name

    * address review comments

    * rename _MergingSnapshotProducer to _SnapshotProducer

    * fix a serious bug

    * update the doc

    * remove merge_append as public API

    * make default to false

    * add test description

    * fix merge conflict

    * fix snapshot_id issue

commit 66b92ff
Author: Fokko Driesprong <[email protected]>
Date:   Wed Jul 10 10:09:20 2024 +0200

    GCS: Fix incorrect token description (apache#909)

commit c25e080
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue Jul 9 20:50:29 2024 -0700

    Bump zipp from 3.17.0 to 3.19.1 (apache#905)

    Bumps [zipp](https://github.com/jaraco/zipp) from 3.17.0 to 3.19.1.
    - [Release notes](https://github.com/jaraco/zipp/releases)
    - [Changelog](https://github.com/jaraco/zipp/blob/main/NEWS.rst)
    - [Commits](jaraco/zipp@v3.17.0...v3.19.1)

    ---
    updated-dependencies:
    - dependency-name: zipp
      dependency-type: indirect
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 301e336
Author: Sung Yun <[email protected]>
Date:   Tue Jul 9 23:35:11 2024 -0400

    Cast 's', 'ms' and 'ns' PyArrow timestamp to 'us' precision on write (apache#848)

commit 3f574d3
Author: Fokko Driesprong <[email protected]>
Date:   Tue Jul 9 11:36:43 2024 +0200

    Support partial deletes (apache#569)

    * Add option to delete datafiles

    This is done through the Iceberg metadata, resulting
    in efficient deletes if the data is partitioned correctly

    * Pull in main

    * WIP

    * Change DataScan to accept Metadata and io

    For the partial deletes I want to do a scan on in
    memory metadata. Changing this API allows this.

    * fix name-mapping issue

    * WIP

    * WIP

    * Moar tests

    * Oops

    * Cleanup

    * WIP

    * WIP

    * Fix summary generation

    * Last few bits

    * Fix the requirement

    * Make ruff happy

    * Comments, thanks Kevin!

    * Comments

    * Append rather than truncate

    * Fix merge conflicts

    * Make the tests pass

    * Add another test

    * Conflicts

    * Add docs (apache#33)

    * docs

    * docs

    * Add a partitioned overwrite test

    * Fix comment

    * Skip empty manifests

    ---------

    Co-authored-by: HonahX <[email protected]>
    Co-authored-by: Sung Yun <[email protected]>

commit cdc3e54
Author: Fokko Driesprong <[email protected]>
Date:   Tue Jul 9 08:28:27 2024 +0200

    Disallow writing empty Manifest files (apache#876)

    * Disallow writing empty Avro files/blocks

    Raising an exception when doing this might look extreme, but
    there is no real good reason to allow this.

    * Relax the constaints a bit

commit b68e109
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Jul 8 22:16:23 2024 -0700

    Bump fastavro from 1.9.4 to 1.9.5 (apache#904)

    Bumps [fastavro](https://github.com/fastavro/fastavro) from 1.9.4 to 1.9.5.
    - [Release notes](https://github.com/fastavro/fastavro/releases)
    - [Changelog](https://github.com/fastavro/fastavro/blob/master/ChangeLog)
    - [Commits](fastavro/fastavro@1.9.4...1.9.5)

    ---
    updated-dependencies:
    - dependency-name: fastavro
      dependency-type: direct:development
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 90547bb
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Jul 8 22:15:39 2024 -0700

    Bump moto from 5.0.10 to 5.0.11 (apache#903)

    Bumps [moto](https://github.com/getmoto/moto) from 5.0.10 to 5.0.11.
    - [Release notes](https://github.com/getmoto/moto/releases)
    - [Changelog](https://github.com/getmoto/moto/blob/master/CHANGELOG.md)
    - [Commits](getmoto/moto@5.0.10...5.0.11)

    ---
    updated-dependencies:
    - dependency-name: moto
      dependency-type: direct:development
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 7dff359
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Jul 7 07:50:19 2024 +0200

    Bump tenacity from 8.4.2 to 8.5.0 (apache#898)

commit 4aa469e
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sat Jul 6 22:30:59 2024 +0200

    Bump certifi from 2024.2.2 to 2024.7.4 (apache#899)

    Bumps [certifi](https://github.com/certifi/python-certifi) from 2024.2.2 to 2024.7.4.
    - [Commits](certifi/python-certifi@2024.02.02...2024.07.04)

    ---
    updated-dependencies:
    - dependency-name: certifi
      dependency-type: indirect
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit aa7ad78
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sat Jul 6 20:37:51 2024 +0200

    Bump deptry from 0.16.1 to 0.16.2 (apache#897)

    Bumps [deptry](https://github.com/fpgmaas/deptry) from 0.16.1 to 0.16.2.
    - [Release notes](https://github.com/fpgmaas/deptry/releases)
    - [Changelog](https://github.com/fpgmaas/deptry/blob/main/CHANGELOG.md)
    - [Commits](fpgmaas/deptry@0.16.1...0.16.2)

    ---
    updated-dependencies:
    - dependency-name: deptry
      dependency-type: direct:development
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants