Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[suggestion] Write path optimization #926

Open
kevinjqliu opened this issue Jul 13, 2024 · 0 comments
Open

[suggestion] Write path optimization #926

kevinjqliu opened this issue Jul 13, 2024 · 0 comments

Comments

@kevinjqliu
Copy link
Contributor

Feature Request / Improvement

Let's investigate the level of abstraction on the write path.

Currently, we are doing schema-compatible checks, schema coercion, bin-packing, transformation, etc at different levels of the stack. It'll be good to optimize and see which functions can be pushed up the stack.

For example, here's what the overwrite path looks like

overwrite
	_dataframe_to_data_files
		write_file
			write_parquet 

(copied over from #910 (review))

Another example #786 (comment)

More info

overwrite checks schema compatibility

_check_schema_compatible(
self._table.schema(), other_schema=df.schema, downcast_ns_timestamp_to_us=downcast_ns_timestamp_to_us
)
self.delete(delete_filter=overwrite_filter, snapshot_properties=snapshot_properties)
with self.update_snapshot(snapshot_properties=snapshot_properties).fast_append() as update_snapshot:
# skip writing data files if the dataframe is empty
if df.shape[0] > 0:
data_files = _dataframe_to_data_files(

_dataframe_to_data_files bin-packs the pyarrow Table

tasks=iter([
WriteTask(write_uuid=write_uuid, task_id=next(counter), record_batches=batches, schema=table_metadata.schema())
for batches in bin_pack_arrow_table(df, target_file_size)
]),

write_parquet transforms table schema

table_schema = task.schema
# if schema needs to be transformed, use the transformed schema and adjust the arrow table accordingly
# otherwise use the original schema
if (sanitized_schema := sanitize_column_names(table_schema)) != table_schema:
file_schema = sanitized_schema
else:
file_schema = table_schema

and
batches = [
_to_requested_schema(
requested_schema=file_schema,
file_schema=table_schema,
batch=batch,
downcast_ns_timestamp_to_us=downcast_ns_timestamp_to_us,
include_field_ids=True,
)
for batch in task.record_batches
]
arrow_table = pa.Table.from_batches(batches)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant