Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support merge manifests on writes (MergeAppend) #363

Merged
merged 22 commits into from
Jul 10, 2024

Conversation

HonahX
Copy link
Contributor

@HonahX HonahX commented Feb 4, 2024

Add MergeAppendFiles. This PR will enable the following configurations:

  • commit.manifest-merge.enabled: Controls whether to automatically merge manifests on writes.
  • commit.manifest.min-count-to-merge: Minimum number of manifests to accumulate before merging.
  • commit.manifest.target-size-bytes: Target size when merging manifest files.

Since commit.manifest-merge.enabled is default to True, we need to make MergeAppend as the default way to append data to align with the property definition and java implementation

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start @HonahX Maybe we want to see if there are any things we can split out, such as the rolling manifest writer.

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
# TODO: need to re-consider the name here: manifest containing positional deletes and manifest containing deleted entries
unmerged_deletes_manifests = [manifest for manifest in existing_manifests if manifest.content == ManifestContent.DELETES]

data_manifest_merge_manager = ManifestMergeManager(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're changing the append operation from a fast-append to a regular append when it hits a threshold. I would be more comfortable with keeping the compaction separate. This way we know that an append/overwrite is always fast and in constant time. For example, if you have a process that appends data, you know how fast it will run (actually it is a function of the number of manifests).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! Totally agree! I was thinking it might be a good time to bring FastAppend and MergeAppend to pyiceberg, making them inherit from a _SnapshotProducer

@Fokko Fokko added this to the PyIceberg 0.7.0 release milestone Feb 7, 2024
@@ -944,7 +949,8 @@ def append(self, df: pa.Table) -> None:
if len(self.spec().fields) > 0:
raise ValueError("Cannot write to partitioned tables")

merge = _MergingSnapshotProducer(operation=Operation.APPEND, table=self)
# TODO: need to consider how to support both _MergeAppend and _FastAppend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to support both? This part of the Java code has been a major source of (hard to debug) problems. Splitting out the commit and compaction path completely would simplify that quite a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good idea to have a separate API in UpdateSnapshot in #446 to compact manifests only. However, I believe retaining MergeAppend is also necessary due to the commit.manifest-merge.enabled setting. This setting, when enabled (which is the default), leads users to expect automatic merging of manifests when they append/overwrite data, rather than having to compact manifest by another API. What do you think?

@HonahX HonahX changed the title Support merge manifests on writes Support merge manifests on writes (MergeAppend) Feb 23, 2024
@HonahX HonahX marked this pull request as ready for review February 26, 2024 10:51
tests/conftest.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @HonahX thanks for working on this and sorry for the late reply. I wanted to take the time to test this properly.

It looks like either the snapshot inheritance is not working properly, or something is off with the writer. I converted the Avro manifest files to JSON using avro-tools, and noticed the following:

{
    "status": 1,
    "snapshot_id": {
        "long": 6972473597951752000
    },
    "data_sequence_number": {
        "long": -1
    },
    "file_sequence_number": {
        "long": -1
    },
...
}
{
    "status": 0,
    "snapshot_id": {
        "long": 3438738529910612500
    },
    "data_sequence_number": {
        "long": -1
    },
    "file_sequence_number": {
        "long": -1
    },
...
}
{
    "status": 0,
    "snapshot_id": {
        "long": 1638533332780464400
    },
    "data_sequence_number": {
        "long": 1
    },
    "file_sequence_number": {
        "long": 1
    },
....
}

Looks like either the snapshot inheritance is not working properly when rewriting the manifests.

@@ -355,6 +355,44 @@ def test_data_files(spark: SparkSession, session_catalog: Catalog, arrow_table_w
assert [row.deleted_data_files_count for row in rows] == [0, 0, 1, 0, 0]


@pytest.mark.integration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you parameterize the test for both V1 and V2 tables?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to assert the manifest-entries as well (only for the merge-appended one).

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for adding this @HonahX . Just one small nit, and otherwise looks good to me!

@@ -1091,7 +1111,7 @@ def append(self, df: pa.Table) -> None:
_check_schema(self.schema(), other_schema=df.schema)

with self.transaction() as txn:
with txn.update_snapshot().fast_append() as update_snapshot:
with txn.update_snapshot().merge_append() as update_snapshot:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we update the new add_files method to also use merge_append?

That seems to be the default choice of snapshot producer in Java

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@syun64 Could you elaborate on the motivation to pick merge-append over a fast-append? For Java, it is for historical reasons since the fast-append was added later. The fast-append creates more metadata but also has:

  • Takes less time to commit, since it doesn't rewrite any existing manifests. This reduces the chances of having a conflict.
  • The time it takes to commit is more predictable and fairly constant to the number of data files that are written.
  • When you static-overwrite partitions as you do in your typical ETL, it will speed up the deletes since it can just drop a whole manifest that the previous fast-append has produced.

The main downside is when you do full-table scans that you need to evaluate more metadata.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good argument @Fokko . Especially in a world where we are potentially moving the work of doing table scans into the Rest Catalog, compacting manifests on write isn't important for this function that already looks to prioritize commit speed over anything else.

I think it makes sense to leave the function to use fast_append and let the users rely on other means of optimizing their table scans.

@HonahX HonahX force-pushed the manifest_compaction branch from 57eba6a to bf63c03 Compare June 3, 2024 07:29
@HonahX
Copy link
Contributor Author

HonahX commented Jun 3, 2024

Sorry for the long wait. I've fixed the sequence number inheritance issue. Previously some manifest entry incorrectly persist the -1 sequence number inherited from a newly constructed ManifestFile. I added a wrapper in ManifestWriter to ensure the sequence number None when unassigned.

I will add tests and update the doc soon

@HonahX HonahX requested review from Fokko and sungwy June 4, 2024 06:53
Copy link
Contributor Author

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests and doc are pushed! @Fokko @syun64 Could you please review this again when you have a chance?

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits, otherwise looks good @HonahX

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me @HonahX 👍

@Fokko
Copy link
Contributor

Fokko commented Jun 30, 2024

I'm seeing some odd behavior:

from pyiceberg.catalog.sql import SqlCatalog
from datetime import datetime, timezone, date
import uuid
import pyarrow as pa

pa_schema = pa.schema([
    ("bool", pa.bool_()),
    ("string", pa.large_string()),
    ("string_long", pa.large_string()),
    ("int", pa.int32()),
    ("long", pa.int64()),
    ("float", pa.float32()),
    ("double", pa.float64()),
    # Not supported by Spark
    # ("time", pa.time64('us')),
    ("timestamp", pa.timestamp(unit="us")),
    ("timestamptz", pa.timestamp(unit="us", tz="UTC")),
    ("date", pa.date32()),
    # Not supported by Spark
    # ("time", pa.time64("us")),
    # Not natively supported by Arrow
    # ("uuid", pa.fixed(16)),
    ("binary", pa.large_binary()),
    ("fixed", pa.binary(16)),
])


TEST_DATA_WITH_NULL = {
    "bool": [False, None, True],
    "string": ["a", None, "z"],
    # Go over the 16 bytes to kick in truncation
    "string_long": ["a" * 22, None, "z" * 22],
    "int": [1, None, 9],
    "long": [1, None, 9],
    "float": [0.0, None, 0.9],
    "double": [0.0, None, 0.9],
    # 'time': [1_000_000, None, 3_000_000],  # Example times: 1s, none, and 3s past midnight #Spark does not support time fields
    "timestamp": [datetime(2023, 1, 1, 19, 25, 00), None, datetime(2023, 3, 1, 19, 25, 00)],
    "timestamptz": [
        datetime(2023, 1, 1, 19, 25, 00, tzinfo=timezone.utc),
        None,
        datetime(2023, 3, 1, 19, 25, 00, tzinfo=timezone.utc),
    ],
    "date": [date(2023, 1, 1), None, date(2023, 3, 1)],
    # Not supported by Spark
    # 'time': [time(1, 22, 0), None, time(19, 25, 0)],
    # Not natively supported by Arrow
    # 'uuid': [uuid.UUID('00000000-0000-0000-0000-000000000000').bytes, None, uuid.UUID('11111111-1111-1111-1111-111111111111').bytes],
    "binary": [b"\01", None, b"\22"],
    "fixed": [
        uuid.UUID("00000000-0000-0000-0000-000000000000").bytes,
        None,
        uuid.UUID("11111111-1111-1111-1111-111111111111").bytes,
    ],
}

catalog = SqlCatalog("test_sql_catalog", uri="sqlite:///:memory:", warehouse=f"/tmp/")

pa_table = pa.Table.from_pydict(TEST_DATA_WITH_NULL, schema=pa_schema)

catalog.create_namespace(('some',))

tbl = catalog.create_table(identifier="some.table", schema=pa_schema, properties={
    "commit.manifest.min-count-to-merge": "2"
})

for num in range(5):
    print(f"Appended: {num}")
    tbl.merge_append(pa_table)

It tries to read a corrupt file (or a bug in our reader):

---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
Cell In[2], line 71
     69 for num in range(5):
     70     print(f"Appended: {num}")
---> 71     tbl.merge_append(pa_table)

File ~/work/iceberg-python/pyiceberg/table/__init__.py:1424, in Table.merge_append(self, df, snapshot_properties)
   1411 """
   1412 Shorthand API for appending a PyArrow table to a table transaction and merging manifests on write.
   1413 
   (...)
   1421     snapshot_properties: Custom properties to be added to the snapshot summary
   1422 """
   1423 with self.transaction() as tx:
-> 1424     tx.merge_append(df=df, snapshot_properties=snapshot_properties)

File ~/work/iceberg-python/pyiceberg/table/__init__.py:472, in Transaction.merge_append(self, df, snapshot_properties)
    468 data_files = _dataframe_to_data_files(
    469     table_metadata=self._table.metadata, write_uuid=update_snapshot.commit_uuid, df=df, io=self._table.io
    470 )
    471 for data_file in data_files:
--> 472     update_snapshot.append_data_file(data_file)

File ~/work/iceberg-python/pyiceberg/table/__init__.py:1899, in UpdateTableMetadata.__exit__(self, _, value, traceback)
   1897 def __exit__(self, _: Any, value: Any, traceback: Any) -> None:
   1898     """Close and commit the change."""
-> 1899     self.commit()

File ~/work/iceberg-python/pyiceberg/table/__init__.py:1895, in UpdateTableMetadata.commit(self)
   1894 def commit(self) -> None:
-> 1895     self._transaction._apply(*self._commit())

File ~/work/iceberg-python/pyiceberg/table/__init__.py:2966, in _SnapshotProducer._commit(self)
   2965 def _commit(self) -> UpdatesAndRequirements:
-> 2966     new_manifests = self._manifests()
   2967     next_sequence_number = self._transaction.table_metadata.next_sequence_number()
   2969     summary = self._summary(self.snapshot_properties)

File ~/work/iceberg-python/pyiceberg/table/__init__.py:2935, in _SnapshotProducer._manifests(self)
   2932 delete_manifests = executor.submit(_write_delete_manifest)
   2933 existing_manifests = executor.submit(self._existing_manifests)
-> 2935 return self._process_manifests(added_manifests.result() + delete_manifests.result() + existing_manifests.result())

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3111, in MergeAppendFiles._process_manifests(self, manifests)
   3102 unmerged_deletes_manifests = [manifest for manifest in manifests if manifest.content == ManifestContent.DELETES]
   3104 data_manifest_merge_manager = _ManifestMergeManager(
   3105     target_size_bytes=self._target_size_bytes,
   3106     min_count_to_merge=self._min_count_to_merge,
   3107     merge_enabled=self._merge_enabled,
   3108     snapshot_producer=self,
   3109 )
-> 3111 return data_manifest_merge_manager.merge_manifests(unmerged_data_manifests) + unmerged_deletes_manifests

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3987, in _ManifestMergeManager.merge_manifests(self, manifests)
   3985 merged_manifests = []
   3986 for spec_id in reversed(groups.keys()):
-> 3987     merged_manifests.extend(self._merge_group(first_manifest, spec_id, groups[spec_id]))
   3989 return merged_manifests

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3974, in _ManifestMergeManager._merge_group(self, first_manifest, spec_id, manifests)
   3963     return output_manifests
   3965 # executor = ExecutorFactory.get_or_create()
   3966 # futures = [executor.submit(merge_bin, b) for b in bins]
   3967 
   (...)
   3971 # for future in concurrent.futures.as_completed(futures):
   3972 #     completed_futures.add(future)
-> 3974 bin_results: List[List[ManifestFile]] = [merge_bin(b) for b in bins]
   3976 return [manifest for bin_result in bin_results for manifest in bin_result]

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3974, in <listcomp>(.0)
   3963     return output_manifests
   3965 # executor = ExecutorFactory.get_or_create()
   3966 # futures = [executor.submit(merge_bin, b) for b in bins]
   3967 
   (...)
   3971 # for future in concurrent.futures.as_completed(futures):
   3972 #     completed_futures.add(future)
-> 3974 bin_results: List[List[ManifestFile]] = [merge_bin(b) for b in bins]
   3976 return [manifest for bin_result in bin_results for manifest in bin_result]

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3961, in _ManifestMergeManager._merge_group.<locals>.merge_bin(manifest_bin)
   3959     output_manifests.extend(manifest_bin)
   3960 else:
-> 3961     output_manifests.append(self._create_manifest(spec_id, manifest_bin))
   3963 return output_manifests

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3934, in _ManifestMergeManager._create_manifest(self, spec_id, manifest_bin)
   3932 with self._snapshot_producer.new_manifest_writer(spec=self._snapshot_producer.spec(spec_id)) as writer:
   3933     for manifest in manifest_bin:
-> 3934         for entry in self._snapshot_producer.fetch_manifest_entry(manifest=manifest, discard_deleted=False):
   3935             if entry.status == ManifestEntryStatus.DELETED and entry.snapshot_id == self._snapshot_producer.snapshot_id:
   3936                 #  only files deleted by this snapshot should be added to the new manifest
   3937                 writer.delete(entry)

File ~/work/iceberg-python/pyiceberg/table/__init__.py:3034, in _SnapshotProducer.fetch_manifest_entry(self, manifest, discard_deleted)
   3033 def fetch_manifest_entry(self, manifest: ManifestFile, discard_deleted: bool = True) -> List[ManifestEntry]:
-> 3034     return manifest.fetch_manifest_entry(io=self._io, discard_deleted=discard_deleted)

File ~/work/iceberg-python/pyiceberg/manifest.py:611, in ManifestFile.fetch_manifest_entry(self, io, discard_deleted)
    609 print(f"MANIFEST: {self.manifest_path}")
    610 input_file = io.new_input(self.manifest_path)
--> 611 with AvroFile[ManifestEntry](
    612     input_file,
    613     MANIFEST_ENTRY_SCHEMAS[DEFAULT_READ_VERSION],
    614     read_types={-1: ManifestEntry, 2: DataFile},
    615     read_enums={0: ManifestEntryStatus, 101: FileFormat, 134: DataFileContent},
    616 ) as reader:
    617     return [
    618         _inherit_from_manifest(entry, self)
    619         for entry in reader
    620         if not discard_deleted or entry.status != ManifestEntryStatus.DELETED
    621     ]

File ~/work/iceberg-python/pyiceberg/avro/file.py:172, in AvroFile.__enter__(self)
    170 with self.input_file.open() as f:
    171     self.decoder = new_decoder(f.read())
--> 172 self.header = self._read_header()
    173 self.schema = self.header.get_schema()
    174 if not self.read_schema:

File ~/work/iceberg-python/pyiceberg/avro/file.py:220, in AvroFile._read_header(self)
    219 def _read_header(self) -> AvroFileHeader:
--> 220     return construct_reader(META_SCHEMA, {-1: AvroFileHeader}).read(self.decoder)

File ~/work/iceberg-python/pyiceberg/avro/reader.py:333, in StructReader.read(self, decoder)
    331 for pos, field_reader in self._field_reader_functions:
    332     if pos is not None:
--> 333         struct[pos] = field_reader(decoder)  # later: pass reuse in here
    334     else:
    335         field_reader(decoder)

File ~/work/iceberg-python/pyiceberg/avro/reader.py:469, in MapReader.read(self, decoder)
    467         block_count = decoder.read_int()
    468 else:
--> 469     block_count = decoder.read_int()
    470     while block_count != 0:
    471         if block_count < 0:

File ~/work/iceberg-python/pyiceberg/avro/decoder_fast.pyx:85, in pyiceberg.avro.decoder_fast.CythonBinaryDecoder.read_int()

File ~/work/iceberg-python/pyiceberg/avro/decoder_fast.pyx:92, in pyiceberg.avro.decoder_fast.CythonBinaryDecoder.read_int()

EOFError: EOF: read 1 bytes

It tries to read this file, which turns out to be empty?

avro-tools tojson /tmp/some.db/table/metadata/94206240-2ae8-47e7-bffe-fd4a1b35d91d-m0.avro
24/06/30 21:44:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

avro-tools getmeta /tmp/some.db/table/metadata/94206240-2ae8-47e7-bffe-fd4a1b35d91d-m0.avro
24/06/30 21:45:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
schema	{"type":"struct","fields":[{"id":1,"name":"bool","type":"boolean","required":false},{"id":2,"name":"string","type":"string","required":false},{"id":3,"name":"string_long","type":"string","required":false},{"id":4,"name":"int","type":"int","required":false},{"id":5,"name":"long","type":"long","required":false},{"id":6,"name":"float","type":"float","required":false},{"id":7,"name":"double","type":"double","required":false},{"id":8,"name":"timestamp","type":"timestamp","required":false},{"id":9,"name":"timestamptz","type":"timestamptz","required":false},{"id":10,"name":"date","type":"date","required":false},{"id":11,"name":"binary","type":"binary","required":false},{"id":12,"name":"fixed","type":"fixed[16]","required":false}],"schema-id":0,"identifier-field-ids":[]}
partition-spec	{"spec-id":0,"fields":[]}
partition-spec-id	0
format-version	2
content	data
avro.schema	{"type": "record", "fields": [{"name": "status", "field-id": 0, "type": "int"}, {"name": "snapshot_id", "field-id": 1, "type": ["null", "long"], "default": null}, {"name": "data_sequence_number", "field-id": 3, "type": ["null", "long"], "default": null}, {"name": "file_sequence_number", "field-id": 4, "type": ["null", "long"], "default": null}, {"name": "data_file", "field-id": 2, "type": {"type": "record", "fields": [{"name": "content", "field-id": 134, "type": "int", "doc": "File format name: avro, orc, or parquet"}, {"name": "file_path", "field-id": 100, "type": "string", "doc": "Location URI with FS scheme"}, {"name": "file_format", "field-id": 101, "type": "string", "doc": "File format name: avro, orc, or parquet"}, {"name": "partition", "field-id": 102, "type": {"type": "record", "fields": [], "name": "r102"}, "doc": "Partition data tuple, schema based on the partition spec"}, {"name": "record_count", "field-id": 103, "type": "long", "doc": "Number of records in the file"}, {"name": "file_size_in_bytes", "field-id": 104, "type": "long", "doc": "Total file size in bytes"}, {"name": "column_sizes", "field-id": 108, "type": ["null", {"type": "array", "items": {"type": "record", "name": "k117_v118", "fields": [{"name": "key", "type": "int", "field-id": 117}, {"name": "value", "type": "long", "field-id": 118}]}, "logicalType": "map"}], "default": null, "doc": "Map of column id to total size on disk"}, {"name": "value_counts", "field-id": 109, "type": ["null", {"type": "array", "items": {"type": "record", "name": "k119_v120", "fields": [{"name": "key", "type": "int", "field-id": 119}, {"name": "value", "type": "long", "field-id": 120}]}, "logicalType": "map"}], "default": null, "doc": "Map of column id to total count, including null and NaN"}, {"name": "null_value_counts", "field-id": 110, "type": ["null", {"type": "array", "items": {"type": "record", "name": "k121_v122", "fields": [{"name": "key", "type": "int", "field-id": 121}, {"name": "value", "type": "long", "field-id": 122}]}, "logicalType": "map"}], "default": null, "doc": "Map of column id to null value count"}, {"name": "nan_value_counts", "field-id": 137, "type": ["null", {"type": "array", "items": {"type": "record", "name": "k138_v139", "fields": [{"name": "key", "type": "int", "field-id": 138}, {"name": "value", "type": "long", "field-id": 139}]}, "logicalType": "map"}], "default": null, "doc": "Map of column id to number of NaN values in the column"}, {"name": "lower_bounds", "field-id": 125, "type": ["null", {"type": "array", "items": {"type": "record", "name": "k126_v127", "fields": [{"name": "key", "type": "int", "field-id": 126}, {"name": "value", "type": "bytes", "field-id": 127}]}, "logicalType": "map"}], "default": null, "doc": "Map of column id to lower bound"}, {"name": "upper_bounds", "field-id": 128, "type": ["null", {"type": "array", "items": {"type": "record", "name": "k129_v130", "fields": [{"name": "key", "type": "int", "field-id": 129}, {"name": "value", "type": "bytes", "field-id": 130}]}, "logicalType": "map"}], "default": null, "doc": "Map of column id to upper bound"}, {"name": "key_metadata", "field-id": 131, "type": ["null", "bytes"], "default": null, "doc": "Encryption key metadata blob"}, {"name": "split_offsets", "field-id": 132, "type": ["null", {"type": "array", "element-id": 133, "items": "long"}], "default": null, "doc": "Splittable offsets"}, {"name": "equality_ids", "field-id": 135, "type": ["null", {"type": "array", "element-id": 136, "items": "long"}], "default": null, "doc": "Field ids used to determine row equality in equality delete files."}, {"name": "sort_order_id", "field-id": 140, "type": ["null", "int"], "default": null, "doc": "ID representing sort order for this file"}], "name": "r2"}}], "name": "manifest_entry"}
avro.codec	null

Looks like we're writing empty files: #876

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good @HonahX ! 🙌

@@ -273,6 +273,10 @@ tbl.append(df)

# or

tbl.merge_append(df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reluctant to expose this to the public API for a couple of reasons:

  • Unsure if folks know what the impact is between choosing fast- or merge appends.
  • It might also be that we do appends as part of the operation (upserts as an obvious one).
  • Another method to the public API :)

How about having something similar as in Java, to control this using a table property: https://iceberg.apache.org/docs/1.5.2/configuration/#table-behavior-properties

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great! I am also +1 on let it controlled by the config. I made merge_append a separate API to mirror the Java side implementation, which has newAppend and newFastAppend APIs. But it seems better to just make the commit.manifest-merge.enabled default to False on python side.

I will still keep FastAppend and MergeAppend as separate class, and keep merge_append in UpdateSnapshot class to ensure clarity, although the current MergeAppend is purely FastAppend + manifest merge.

Just curious, why not Java side newAppend return an FastAppend impl when commit.manifest-merge.enabled is False. Is it due to some backward compatibiilty issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think the use-case of the Java library is slightly different, since that's mostly used in query engines.

Is it due to some backward compatibiilty issue?

I think it is for historical reasons, since the fast-append was added later on :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I like how you split it out in classes, it is much cleaner now 👍

mkdocs/docs/configuration.md Outdated Show resolved Hide resolved
mkdocs/docs/configuration.md Outdated Show resolved Hide resolved
Comment on lines 3004 to 3088
output_file_location = _new_manifest_path(
location=self._transaction.table_metadata.location, num=0, commit_uuid=self.commit_uuid
)
with write_manifest(
format_version=self._transaction.table_metadata.format_version,
spec=self._transaction.table_metadata.spec(),
schema=self._transaction.table_metadata.schema(),
output_file=self._io.new_output(output_file_location),
output_file=self.new_manifest_output(),
Copy link
Contributor Author

@HonahX HonahX Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko Thanks for the detailed code example and stacktrace! With the help of them and #876, I found the root cause of the bug: the collision of the names of manifest files within a commit. I've modified the code to avoid that.

It is hard to find because if the file is in the object storage, when FileIO opens a new OutputFile on the same location, the existing file is still readable until the OutputFile "commit". So for integration test that use minio, everything works fine. We won't find any issue until we rollback to some previous snapshot.

For the in-memory SqlCatalog test, since the file is in the local filesystem, the existing file become empty/corrupted immediately after we open a new OutputFile on the same location. This behavior causes the ManifestMergeManager write some empty file and the issue emerges.

I've included a temporary test in test_sql.py to ensure correctness of the current change. I will try to formalize that tommorrow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this and fixing it 🙌

@Fokko
Copy link
Contributor

Fokko commented Jul 4, 2024

Doing some testing with avro-tools, asserting the state after 5 append operations with "commit.manifest.min-count-to-merge": "2"

V1 Table

Manifest-list

5th manifest-list

{
    "manifest_path": "/tmp/some.db/table/metadata/80ba9f84-99af-4af1-b8f5-4caa254645c2-m1.avro",
    "manifest_length": 6878,
    "partition_spec_id": 0,
    "content": 0,
    "sequence_number": 5,
    "min_sequence_number": 1,
    "added_snapshot_id": 6508090689697406000,
    "added_files_count": 1,
    "existing_files_count": 4,
    "deleted_files_count": 0,
    "added_rows_count": 3,
    "existing_rows_count": 12,
    "deleted_rows_count": 0,
    "partitions": {
        "array": []
    },
    "key_metadata": null
}

4th manifest-list

{
    "manifest_path": "/tmp/some.db/table/metadata/88807344-0e23-413c-827e-2a9ec63c6233-m1.avro",
    "manifest_length": 6436,
    "partition_spec_id": 0,
    "content": 0,
    "sequence_number": 4,
    "min_sequence_number": 1,
    "added_snapshot_id": 3455109142449701000,
    "added_files_count": 1,
    "existing_files_count": 3,
    "deleted_files_count": 0,
    "added_rows_count": 3,
    "existing_rows_count": 9,
    "deleted_rows_count": 0,
    "partitions": {
        "array": []
    },
    "key_metadata": null
}

Manifests

We have 5 manifests as expected:

avro-tools tojson /tmp/some.db/table/metadata/80ba9f84-99af-4af1-b8f5-4caa254645c2-m1.avro | wc -l 
       5

Last one:

{
    "status": 1,
    "snapshot_id": {
        "long": 6508090689697406000
    },
    "data_sequence_number": null,
    "file_sequence_number": null,
    "data_file": {
        "content": 0,
        "file_path": "/tmp/some.db/table/data/00000-0-80ba9f84-99af-4af1-b8f5-4caa254645c2.parquet",
        "file_format": "PARQUET",
        "partition": {},
        "record_count": 3,
        "file_size_in_bytes": 5459,
        "column_sizes": { ... },
        "value_counts": { ... },
        "null_value_counts": { ... },
        "nan_value_counts": { ... },
        "lower_bounds": { ... },
        "upper_bounds": { ... },
        "key_metadata": null,
        "split_offsets": {
            "array": [
                4
            ]
        },
        "equality_ids": null,
        "sort_order_id": null
    }
}

First one:

{
    "status": 0,
    "snapshot_id": {
        "long": 6508090689697406000
    },
    "data_sequence_number": {
        "long": 1
    },
    "file_sequence_number": {
        "long": 1
    },
    "data_file": {
        "content": 0,
        "file_path": "/tmp/some.db/table/data/00000-0-bbd4029c-510a-48e6-a905-ab5b69a832e8.parquet",
        "file_format": "PARQUET",
        "partition": {},
        "record_count": 3,
        "file_size_in_bytes": 5459,
        "column_sizes": { ... },
        "value_counts": { ... },
        "null_value_counts": { ... },
        "nan_value_counts": { ... },
        "lower_bounds": { ... },
        "upper_bounds": { ... },
        "key_metadata": null,
        "split_offsets": {
            "array": [
                4
            ]
        },
        "equality_ids": null,
        "sort_order_id": null
    }
}

This looks good, except for one thing: the snapshot_id is off, as from the spec:

Snapshot id where the file was added, or deleted if status is 2. Inherited when null.

This should be the ID of the first append operation.

V2 Table

Manifest list

5th manifest-list

{
    "manifest_path": "/tmp/some.db/tablev2/metadata/93717a88-1cea-4e3d-a69a-00ce3d087822-m1.avro",
    "manifest_length": 6883,
    "partition_spec_id": 0,
    "content": 0,
    "sequence_number": 5,
    "min_sequence_number": 1,
    "added_snapshot_id": 898025966831056900,
    "added_files_count": 1,
    "existing_files_count": 4,
    "deleted_files_count": 0,
    "added_rows_count": 3,
    "existing_rows_count": 12,
    "deleted_rows_count": 0,
    "partitions": {
        "array": []
    },
    "key_metadata": null
}

4th manifest-list

{
    "manifest_path": "/tmp/some.db/tablev2/metadata/5c64a07c-4b8a-4be1-a751-d4fd339560e2-m0.avro",
    "manifest_length": 5127,
    "partition_spec_id": 0,
    "content": 0,
    "sequence_number": 1,
    "min_sequence_number": 1,
    "added_snapshot_id": 1343032504684197000,
    "added_files_count": 1,
    "existing_files_count": 0,
    "deleted_files_count": 0,
    "added_rows_count": 3,
    "existing_rows_count": 0,
    "deleted_rows_count": 0,
    "partitions": {
        "array": []
    },
    "key_metadata": null
}

Manifests

last manifest file in manifest-list

{
    "status": 1,
    "snapshot_id": {
        "long": 898025966831056900
    },
    "data_sequence_number": null,
    "file_sequence_number": null,
    "data_file": {
        "content": 0,
        "file_path": "/tmp/some.db/tablev2/data/00000-0-93717a88-1cea-4e3d-a69a-00ce3d087822.parquet",
        "file_format": "PARQUET",
        "partition": {},
        "record_count": 3,
        "file_size_in_bytes": 5459,
        "column_sizes": { ... },
        "value_counts": { ... },
        "null_value_counts": { ... },
        "nan_value_counts": { ... },
        "lower_bounds": { ... },
        "upper_bounds": { ... },
        "key_metadata": null,
        "split_offsets": {
            "array": [
                4
            ]
        },
        "equality_ids": null,
        "sort_order_id": null
    }
}

First manifest in manifest-list

{
    "status": 0,
    "snapshot_id": {
        "long": 898025966831056900
    },
    "data_sequence_number": {
        "long": 1
    },
    "file_sequence_number": {
        "long": 1
    },
    "data_file": {
        "content": 0,
        "file_path": "/tmp/some.db/tablev2/data/00000-0-5c64a07c-4b8a-4be1-a751-d4fd339560e2.parquet",
        "file_format": "PARQUET",
        "partition": {},
        "record_count": 3,
        "file_size_in_bytes": 5459,
        "column_sizes": { ... },
        "value_counts": { ... },
        "null_value_counts": { ... },
        "nan_value_counts": { ... },
        "lower_bounds": { ... },
        "upper_bounds": { ... },
        "key_metadata": null,
        "split_offsets": {
            "array": [
                4
            ]
        },
        "equality_ids": null,
        "sort_order_id": null
    }
}

Except for the snapshot-id and #893 this looks great! 🥳

@Fokko
Copy link
Contributor

Fokko commented Jul 4, 2024

Another test with commit.manifest.min-count-to-merge set to 100, and doing 500 append operations:

avro-tools tojson /tmp/some.db/woooo/metadata/snap-3952911087333379496-0-27b9a632-7ee0-4246-aaf2-fc6d8cb1dce5.avro        
{"manifest_path":"/tmp/some.db/woooo/metadata/27b9a632-7ee0-4246-aaf2-fc6d8cb1dce5-m0.avro","manifest_length":5125,"partition_spec_id":0,"content":0,"sequence_number":500,"min_sequence_number":500,"added_snapshot_id":3952911087333379496,"added_files_count":1,"existing_files_count":0,"deleted_files_count":0,"added_rows_count":3,"existing_rows_count":0,"deleted_rows_count":0,"partitions":{"array":[]},"key_metadata":null}
{"manifest_path":"/tmp/some.db/woooo/metadata/dac5af38-f01b-4a59-9e4c-b14a26706e75-m0.avro","manifest_length":5126,"partition_spec_id":0,"content":0,"sequence_number":499,"min_sequence_number":499,"added_snapshot_id":8943105647176444976,"added_files_count":1,"existing_files_count":0,"deleted_files_count":0,"added_rows_count":3,"existing_rows_count":0,"deleted_rows_count":0,"partitions":{"array":[]},"key_metadata":null}
{"manifest_path":"/tmp/some.db/woooo/metadata/ed164af5-dda7-4e3e-9b67-fcb2fd78771b-m0.avro","manifest_length":5125,"partition_spec_id":0,"content":0,"sequence_number":498,"min_sequence_number":498,"added_snapshot_id":723002263384967579,"added_files_count":1,"existing_files_count":0,"deleted_files_count":0,"added_rows_count":3,"existing_rows_count":0,"deleted_rows_count":0,"partitions":{"array":[]},"key_metadata":null}
{"manifest_path":"/tmp/some.db/woooo/metadata/e2d3e14e-8caf-4ca0-9515-c0a19c2a5658-m0.avro","manifest_length":5126,"partition_spec_id":0,"content":0,"sequence_number":497,"min_sequence_number":497,"added_snapshot_id":6977509396340474362,"added_files_count":1,"existing_files_count":0,"deleted_files_count":0,"added_rows_count":3,"existing_rows_count":0,"deleted_rows_count":0,"partitions":{"array":[]},"key_metadata":null}
{"manifest_path":"/tmp/some.db/woooo/metadata/3cc77cfe-b68c-4071-9f70-41cc3933f0af-m1.avro","manifest_length":222800,"partition_spec_id":0,"content":0,"sequence_number":496,"min_sequence_number":1,"added_snapshot_id":7132518699806947299,"added_files_count":1,"existing_files_count":495,"deleted_files_count":0,"added_rows_count":3,"existing_rows_count":1485,"deleted_rows_count":0,"partitions":{"array":[]},"key_metadata":null}

I don't think it merges the manifests as it should:

➜  iceberg-python git:(manifest_compaction) avro-tools tojson /tmp/some.db/woooo/metadata/3cc77cfe-b68c-4071-9f70-41cc3933f0af-m1.avro | wc -l                 
24/07/04 21:04:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
     496
➜  iceberg-python git:(manifest_compaction) avro-tools tojson /tmp/some.db/woooo/metadata/27b9a632-7ee0-4246-aaf2-fc6d8cb1dce5-m0.avro | wc -l
24/07/04 21:04:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
       1

I would expect the manifest-entries to be distributed more evenly over the manifests to ensure maximum parallelization.

HonahX added 3 commits July 9, 2024 22:39
# Conflicts:
#	pyiceberg/table/__init__.py
#	tests/integration/test_writes/test_writes.py
@HonahX
Copy link
Contributor Author

HonahX commented Jul 10, 2024

Another test with commit.manifest.min-count-to-merge set to 100, and doing 500 append operations:

I think the observed behavior aligns with Java's merge_append. Each time we do one append, we add one manifest. At 100th append, when the number of manifest reach 100, the merge manager merge all of them to a new manifest file because they are all in the same "bin". This happens whenever the number of manifest reach 100, thus leaving us with a large manifest and 4 small ones.

I use spark to do the similar thing and get a similar result

@pytest.mark.integration
def test_spark_ref_behavior(spark: SparkSession, session_catalog: Catalog, arrow_table_with_null: pa.Table) -> None:
    identifier = "default.test_spark_ref_behavior"
    tbl = _create_table(session_catalog, identifier,
                        {"commit.manifest-merge.enabled": "true", "commit.manifest.min-count-to-merge": "10", "format-version": 2}, [])
    spark_df = spark.createDataFrame(arrow_table_with_null.to_pandas())

    for i in range(50):
        spark_df.writeTo(f"integration.{identifier}").append()
    tbl = session_catalog.load_table(identifier)
    tbl_a_manifests = tbl.current_snapshot().manifests(tbl.io)
    for manifest in tbl_a_manifests:
        print(
            f"Manifest: added: {manifest.added_files_count}, existing: {manifest.existing_files_count}, deleted: {manifest.deleted_files_count}")
=====
Manifest: added: 3, existing: 0, deleted: 0
Manifest: added: 3, existing: 0, deleted: 0
Manifest: added: 3, existing: 0, deleted: 0
Manifest: added: 3, existing: 0, deleted: 0
Manifest: added: 3, existing: 135, deleted: 0

To distribute manifest entries more evenly, I think we need to adjust the commit.manifest.target-size-bytes accordingly since this property controls the size of the bin.

I think this also reveal the value of the fast_append + compaction model, which make things more explicit

assert tbl_a_data_file["file_path"].startswith("s3://warehouse/default/merge_manifest_a/data/")
if tbl_a_data_file["file_path"] == first_data_file_path:
# verify that the snapshot id recorded should be the one where the file was added
assert tbl_a_entries["snapshot_id"][i] == first_snapshot_id
Copy link
Contributor Author

@HonahX HonahX Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a test to verify the snapshot_id issue

@Fokko
Copy link
Contributor

Fokko commented Jul 10, 2024

To distribute manifest entries more evenly, I think we need to adjust the commit.manifest.target-size-bytes accordingly since this property controls the size of the bin.

Thanks, that makes actually a lot of sense 👍

@Fokko Fokko merged commit 77a07c9 into apache:main Jul 10, 2024
7 checks passed
@Fokko
Copy link
Contributor

Fokko commented Jul 10, 2024

Whoo 🥳 Thanks @HonahX for working on this, and thanks @syun64 for the review 🙌

felixscherz added a commit to felixscherz/iceberg-python that referenced this pull request Jul 17, 2024
commit 1ed3abd
Author: Sung Yun <[email protected]>
Date:   Wed Jul 17 02:04:52 2024 -0400

    Allow writing `pa.Table` that are either a subset of table schema or in arbitrary order, and support type promotion on write (apache#921)

    * merge

    * thanks @HonahX :)

    Co-authored-by: Honah J. <[email protected]>

    * support promote

    * revert promote

    * use a visitor

    * support promotion on write

    * fix

    * Thank you @Fokko !

    Co-authored-by: Fokko Driesprong <[email protected]>

    * revert

    * add-files promotiontest

    * support promote for add_files

    * add tests for uuid

    * add_files subset schema test

    ---------

    Co-authored-by: Honah J. <[email protected]>
    Co-authored-by: Fokko Driesprong <[email protected]>

commit 0f2e19e
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Jul 15 23:25:08 2024 -0700

    Bump zstandard from 0.22.0 to 0.23.0 (apache#934)

    Bumps [zstandard](https://github.com/indygreg/python-zstandard) from 0.22.0 to 0.23.0.
    - [Release notes](https://github.com/indygreg/python-zstandard/releases)
    - [Changelog](https://github.com/indygreg/python-zstandard/blob/main/docs/news.rst)
    - [Commits](indygreg/python-zstandard@0.22.0...0.23.0)

    ---
    updated-dependencies:
    - dependency-name: zstandard
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit ec73d97
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Jul 15 23:24:47 2024 -0700

    Bump griffe from 0.47.0 to 0.48.0 (apache#933)

    Bumps [griffe](https://github.com/mkdocstrings/griffe) from 0.47.0 to 0.48.0.
    - [Release notes](https://github.com/mkdocstrings/griffe/releases)
    - [Changelog](https://github.com/mkdocstrings/griffe/blob/main/CHANGELOG.md)
    - [Commits](mkdocstrings/griffe@0.47.0...0.48.0)

    ---
    updated-dependencies:
    - dependency-name: griffe
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit d05a423
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Jul 15 23:24:16 2024 -0700

    Bump mkdocs-material from 9.5.28 to 9.5.29 (apache#932)

    Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.28 to 9.5.29.
    - [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
    - [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
    - [Commits](squidfunk/mkdocs-material@9.5.28...9.5.29)

    ---
    updated-dependencies:
    - dependency-name: mkdocs-material
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit e27cd90
Author: Yair Halevi (Spock) <[email protected]>
Date:   Sun Jul 14 22:11:04 2024 +0300

    Allow empty `names` in mapped field of Name Mapping (apache#927)

    * Remove check_at_least_one field validator

    Iceberg spec permits an emtpy list of names in the default name mapping. check_at_least_one is therefore unnecessary.

    * Remove irrelevant test case

    * Fixing pydantic model

    No longer requiring minimum length of names list to be 1.

    * Added test case for empty names in name mapping

    * Fixed formatting error

commit 3f44dfe
Author: Soumya Ghosh <[email protected]>
Date:   Sun Jul 14 00:35:38 2024 +0530

    Lowercase bool values in table properties (apache#924)

commit b11cdb5
Author: Sung Yun <[email protected]>
Date:   Fri Jul 12 16:45:04 2024 -0400

    Deprecate to_requested_schema (apache#918)

    * deprecate to_requested_schema

    * prep for release

commit a3dd531
Author: Honah J <[email protected]>
Date:   Fri Jul 12 13:14:40 2024 -0700

    Glue endpoint config variable, continue apache#530 (apache#920)

    Co-authored-by: Seb Pretzer <[email protected]>

commit 32e8f88
Author: Sung Yun <[email protected]>
Date:   Fri Jul 12 15:26:00 2024 -0400

    support PyArrow timestamptz with Etc/UTC (apache#910)

    Co-authored-by: Fokko Driesprong <[email protected]>

commit f6d56e9
Author: Sung Yun <[email protected]>
Date:   Fri Jul 12 05:31:06 2024 -0400

    fix invalidation logic (apache#911)

commit 6488ad8
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Thu Jul 11 22:56:48 2024 -0700

    Bump coverage from 7.5.4 to 7.6.0 (apache#917)

    Bumps [coverage](https://github.com/nedbat/coveragepy) from 7.5.4 to 7.6.0.
    - [Release notes](https://github.com/nedbat/coveragepy/releases)
    - [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst)
    - [Commits](nedbat/coveragepy@7.5.4...7.6.0)

    ---
    updated-dependencies:
    - dependency-name: coverage
      dependency-type: direct:development
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit dceedfa
Author: Sung Yun <[email protected]>
Date:   Thu Jul 11 20:32:14 2024 -0400

    Check if schema is compatible in `add_files` API (apache#907)

    Co-authored-by: Fokko Driesprong <[email protected]>

commit aceed2a
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Thu Jul 11 15:52:06 2024 +0200

    Bump mypy-boto3-glue from 1.34.136 to 1.34.143 (apache#912)

    Bumps [mypy-boto3-glue](https://github.com/youtype/mypy_boto3_builder) from 1.34.136 to 1.34.143.
    - [Release notes](https://github.com/youtype/mypy_boto3_builder/releases)
    - [Commits](https://github.com/youtype/mypy_boto3_builder/commits)

    ---
    updated-dependencies:
    - dependency-name: mypy-boto3-glue
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 1b9b884
Author: Fokko Driesprong <[email protected]>
Date:   Thu Jul 11 12:45:20 2024 +0200

    PyArrow: Don't enforce the schema when reading/writing (apache#902)

    * PyArrow: Don't enforce the schema

    PyIceberg struggled with the different type of arrow, such as
    the `string` and `large_string`. They represent the same, but are
    different under the hood.

    My take is that we should hide these kind of details from the user
    as much as possible. Now we went down the road of passing in the
    Iceberg schema into Arrow, but when doing this, Iceberg has to
    decide if it is a large or non-large type.

    This PR removes passing down the schema in order to let Arrow decide
    unless:

     - The type should be evolved
     - In case of re-ordering, we reorder the original types

    * WIP

    * Reuse Table schema

    * Make linter happy

    * Squash some bugs

    * Thanks Sung!

    Co-authored-by: Sung Yun <[email protected]>

    * Moar code moar bugs

    * Remove the variables wrt file sizes

    * Linting

    * Go with large ones for now

    * Missed one there!

    ---------

    Co-authored-by: Sung Yun <[email protected]>

commit 8f47dfd
Author: Soumya Ghosh <[email protected]>
Date:   Thu Jul 11 11:52:55 2024 +0530

    Move determine_partitions and helper methods to io.pyarrow (apache#906)

commit 5aa451d
Author: Soumya Ghosh <[email protected]>
Date:   Thu Jul 11 07:57:05 2024 +0530

    Rename data_sequence_number to sequence_number in ManifestEntry (apache#900)

commit 77a07c9
Author: Honah J <[email protected]>
Date:   Wed Jul 10 03:56:13 2024 -0700

    Support MergeAppend operations (apache#363)

    * add ListPacker + tests

    * add merge append

    * add merge_append

    * fix snapshot inheritance

    * test manifest file and entries

    * add doc

    * fix lint

    * change test name

    * address review comments

    * rename _MergingSnapshotProducer to _SnapshotProducer

    * fix a serious bug

    * update the doc

    * remove merge_append as public API

    * make default to false

    * add test description

    * fix merge conflict

    * fix snapshot_id issue

commit 66b92ff
Author: Fokko Driesprong <[email protected]>
Date:   Wed Jul 10 10:09:20 2024 +0200

    GCS: Fix incorrect token description (apache#909)

commit c25e080
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue Jul 9 20:50:29 2024 -0700

    Bump zipp from 3.17.0 to 3.19.1 (apache#905)

    Bumps [zipp](https://github.com/jaraco/zipp) from 3.17.0 to 3.19.1.
    - [Release notes](https://github.com/jaraco/zipp/releases)
    - [Changelog](https://github.com/jaraco/zipp/blob/main/NEWS.rst)
    - [Commits](jaraco/zipp@v3.17.0...v3.19.1)

    ---
    updated-dependencies:
    - dependency-name: zipp
      dependency-type: indirect
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 301e336
Author: Sung Yun <[email protected]>
Date:   Tue Jul 9 23:35:11 2024 -0400

    Cast 's', 'ms' and 'ns' PyArrow timestamp to 'us' precision on write (apache#848)

commit 3f574d3
Author: Fokko Driesprong <[email protected]>
Date:   Tue Jul 9 11:36:43 2024 +0200

    Support partial deletes (apache#569)

    * Add option to delete datafiles

    This is done through the Iceberg metadata, resulting
    in efficient deletes if the data is partitioned correctly

    * Pull in main

    * WIP

    * Change DataScan to accept Metadata and io

    For the partial deletes I want to do a scan on in
    memory metadata. Changing this API allows this.

    * fix name-mapping issue

    * WIP

    * WIP

    * Moar tests

    * Oops

    * Cleanup

    * WIP

    * WIP

    * Fix summary generation

    * Last few bits

    * Fix the requirement

    * Make ruff happy

    * Comments, thanks Kevin!

    * Comments

    * Append rather than truncate

    * Fix merge conflicts

    * Make the tests pass

    * Add another test

    * Conflicts

    * Add docs (apache#33)

    * docs

    * docs

    * Add a partitioned overwrite test

    * Fix comment

    * Skip empty manifests

    ---------

    Co-authored-by: HonahX <[email protected]>
    Co-authored-by: Sung Yun <[email protected]>

commit cdc3e54
Author: Fokko Driesprong <[email protected]>
Date:   Tue Jul 9 08:28:27 2024 +0200

    Disallow writing empty Manifest files (apache#876)

    * Disallow writing empty Avro files/blocks

    Raising an exception when doing this might look extreme, but
    there is no real good reason to allow this.

    * Relax the constaints a bit

commit b68e109
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Jul 8 22:16:23 2024 -0700

    Bump fastavro from 1.9.4 to 1.9.5 (apache#904)

    Bumps [fastavro](https://github.com/fastavro/fastavro) from 1.9.4 to 1.9.5.
    - [Release notes](https://github.com/fastavro/fastavro/releases)
    - [Changelog](https://github.com/fastavro/fastavro/blob/master/ChangeLog)
    - [Commits](fastavro/fastavro@1.9.4...1.9.5)

    ---
    updated-dependencies:
    - dependency-name: fastavro
      dependency-type: direct:development
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 90547bb
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Jul 8 22:15:39 2024 -0700

    Bump moto from 5.0.10 to 5.0.11 (apache#903)

    Bumps [moto](https://github.com/getmoto/moto) from 5.0.10 to 5.0.11.
    - [Release notes](https://github.com/getmoto/moto/releases)
    - [Changelog](https://github.com/getmoto/moto/blob/master/CHANGELOG.md)
    - [Commits](getmoto/moto@5.0.10...5.0.11)

    ---
    updated-dependencies:
    - dependency-name: moto
      dependency-type: direct:development
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 7dff359
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Jul 7 07:50:19 2024 +0200

    Bump tenacity from 8.4.2 to 8.5.0 (apache#898)

commit 4aa469e
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sat Jul 6 22:30:59 2024 +0200

    Bump certifi from 2024.2.2 to 2024.7.4 (apache#899)

    Bumps [certifi](https://github.com/certifi/python-certifi) from 2024.2.2 to 2024.7.4.
    - [Commits](certifi/python-certifi@2024.02.02...2024.07.04)

    ---
    updated-dependencies:
    - dependency-name: certifi
      dependency-type: indirect
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit aa7ad78
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sat Jul 6 20:37:51 2024 +0200

    Bump deptry from 0.16.1 to 0.16.2 (apache#897)

    Bumps [deptry](https://github.com/fpgmaas/deptry) from 0.16.1 to 0.16.2.
    - [Release notes](https://github.com/fpgmaas/deptry/releases)
    - [Changelog](https://github.com/fpgmaas/deptry/blob/main/CHANGELOG.md)
    - [Commits](fpgmaas/deptry@0.16.1...0.16.2)

    ---
    updated-dependencies:
    - dependency-name: deptry
      dependency-type: direct:development
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants