You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently when writing a dataset (10,000 rows and 50 columns) to deltalake table in s3 it takes approximately 7 seconds. Now with the increase in the number of writers the time taken to write also increases (for 8 writers, it is around 25 seconds). And for reference writing only parquet files using pyarrow it takes around 2.5 seconds (irrespective of number of workers). Is this expected writing latency or are there any optimizations that can be done to improve the write performance?
Note: The number of writers was increased using python multiprocessing.
How to reproduce it:
import pandas as pd
import pyarrow as pa
from deltalake import write_deltalake
import time
import random
# Creating a dataframe with 50 columns and 10000 rows
num_columns = 50
num_rows = 10000
data = {
f'col{i+1}': [f'row{random.randint(1, 2)}' for j in range(num_rows)] for i in range(num_columns)
}
df = pd.DataFrame(data)
table = pa.Table.from_pandas(df)
partition_cols = ["col1", "col2", "col3", "col4"]
storage_options = {
'AWS_S3_LOCKING_PROVIDER': 'dynamodb',
'DELTA_DYNAMO_TABLE_NAME': '_delta_log',
}
start = time.time()
write_deltalake("s3a://test_deltalake_1", table, partition_by=partition_cols
, mode="append", engine="rust", storage_options=storage_options)
end = time.time()
print("Time taken: ", end - start)
import pyarrow.parquet as pq
from s3fs import S3FileSystem
start = time.time()
fs = S3FileSystem()
pq.write_to_dataset(
root_path="s3a://test_parquet_1", table=table, partition_cols=partition_cols,
filesystem=fs
)
end = time.time()
print("Time taken: ", end - start)
The text was updated successfully, but these errors were encountered:
There is write contention, so conflict resolution comes into play, that might be the addition of latency. In the future delta-rs might adopt the managed commits which delegates that work to an external system, which should potentially improve this.
In the mean time if you can to write to isolated partitions with an explicit partition filter.
I'll echo what @ion-elgreco said but also add that the best way to avoid contention in S3 is to make sure the data file writes are as large as make sense. Lots of small parquet file writes with lots of small transactions will lead to increased contention.
Environment
Delta-rs version: 0.18.2
Binding: rust
Environment: Mac OS - 14.5
Issue
Currently when writing a dataset (10,000 rows and 50 columns) to deltalake table in s3 it takes approximately 7 seconds. Now with the increase in the number of writers the time taken to write also increases (for 8 writers, it is around 25 seconds). And for reference writing only parquet files using pyarrow it takes around 2.5 seconds (irrespective of number of workers). Is this expected writing latency or are there any optimizations that can be done to improve the write performance?
Note: The number of writers was increased using python multiprocessing.
How to reproduce it:
The text was updated successfully, but these errors were encountered: