Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write performance degrades with multiple writers #2683

Closed
Aiden-Frost opened this issue Jul 17, 2024 · 2 comments
Closed

Write performance degrades with multiple writers #2683

Aiden-Frost opened this issue Jul 17, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@Aiden-Frost
Copy link

Environment

Delta-rs version: 0.18.2

Binding: rust

Environment: Mac OS - 14.5

  • Cloud provider: AWS

Issue

Currently when writing a dataset (10,000 rows and 50 columns) to deltalake table in s3 it takes approximately 7 seconds. Now with the increase in the number of writers the time taken to write also increases (for 8 writers, it is around 25 seconds). And for reference writing only parquet files using pyarrow it takes around 2.5 seconds (irrespective of number of workers). Is this expected writing latency or are there any optimizations that can be done to improve the write performance?
Note: The number of writers was increased using python multiprocessing.

How to reproduce it:

import pandas as pd
import pyarrow as pa
from deltalake import write_deltalake

import time
import random

# Creating a dataframe with 50 columns and 10000 rows
num_columns = 50
num_rows = 10000

data = {
    f'col{i+1}': [f'row{random.randint(1, 2)}' for j in range(num_rows)] for i in range(num_columns)
}

df = pd.DataFrame(data)

table = pa.Table.from_pandas(df)

partition_cols = ["col1", "col2", "col3", "col4"]

storage_options = {
    'AWS_S3_LOCKING_PROVIDER': 'dynamodb',
    'DELTA_DYNAMO_TABLE_NAME': '_delta_log',
}
start = time.time()
write_deltalake("s3a://test_deltalake_1", table, partition_by=partition_cols
                , mode="append", engine="rust", storage_options=storage_options)

end = time.time()

print("Time taken: ", end - start)

import pyarrow.parquet as pq
from s3fs import S3FileSystem

start = time.time()

fs = S3FileSystem()

pq.write_to_dataset(
    root_path="s3a://test_parquet_1", table=table, partition_cols=partition_cols,
    filesystem=fs
)

end = time.time()

print("Time taken: ", end - start)
@Aiden-Frost Aiden-Frost added the bug Something isn't working label Jul 17, 2024
@ion-elgreco
Copy link
Collaborator

There is write contention, so conflict resolution comes into play, that might be the addition of latency. In the future delta-rs might adopt the managed commits which delegates that work to an external system, which should potentially improve this.

In the mean time if you can to write to isolated partitions with an explicit partition filter.

@rtyler
Copy link
Member

rtyler commented Aug 10, 2024

I'll echo what @ion-elgreco said but also add that the best way to avoid contention in S3 is to make sure the data file writes are as large as make sense. Lots of small parquet file writes with lots of small transactions will lead to increased contention.

@rtyler rtyler closed this as completed Aug 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants