Performance issues when loading data from S3 #36587

shaktishp · 2023-07-10T09:19:09Z

Describe the usage question you have. Please include as many useful details as possible.

We are doing one small poc wherein we are comparing the performance when we load parquet files directly from s3 or from local file system

jupyter notebook code snippet

jupyter notebook and s3 is in same region

Import pyarrow.dataset as ds
import time

s3Dataset = ds.dataset(‘location’) # s3 or local file system
scanner = s3Dataset.scanner()
batches = scanner.to_batches()
st = time.time()

for batch in batches:
batch.num_rows

et = time.time()

elapsed = et - st

code ends here

total files 50 and total size of 500 mb

Time taken to read from s3 is 970 sec
Time taken to read from local file system 5 sec

Any insight why S3 is taking so much time?

Is there any settings we are missing when we are reading files from S3 which would give decent performance?

Any help appreciated.

Component(s)

FlightRPC, Parquet, Python, Other

mapleFU · 2023-07-10T13:46:59Z

Generally, reading from S3 would usally much slower than read from local file. However, I guess it downgrade too much. Would you mind do a profile and find which part is spending more time when blocking?

And to_batches can specify filter and some read-ahead arguments, configure read ctx and threads, I guess you can try to enlarge them

shaktishp · 2023-07-10T14:20:49Z

Will do the profiling and share the details with you

westonpace · 2023-07-10T14:47:28Z

S3 performance will depend of your connection to the S3 servers. Are you running this test on a local device? Or on an EC2 server?

If it's a local device then what kind of connection do you have to the internet? 970 seconds for 500MB of data is very slow. This is less than 0.1Mbps so any decent connection should be faster than this. If you set the environment variable AWS_EC2_METADATA_DISABLED (e.g. export AWS_EC2_METADATA_DISABLED=true) does it have any affect on performance?

shaktishp · 2023-07-11T06:39:39Z

@westonpace The code is running on a kubernetes pod.

westonpace · 2023-07-11T23:44:51Z

Are you able to try setting the environment variable I suggested?

shaktishp · 2023-07-12T15:06:31Z

I tried setting the env variable AWS_EC2_METADATA_DISABLED=true but i was not able to connect to S3.

westonpace · 2023-07-12T15:24:41Z

I am a little confused then.

When any operation is run by the S3 filesystem then the AWS SDK will attempt to determine credentials for that action. Typically this is done by looking in the user's config file (e.g. for ~/.aws/config). If this configuration file is not found then it will attempt to contact a special IP address that EC2 machines have configured which tells the EC2 machine what its configuration is.

This attempt to contact that special IP address can be very slow, depending on the network configuration of the machine (sometimes it will spend minutes waiting for a timeout). Setting variable AWS_EC2_METADATA_DISABLED will disable the check but that should only affect your connection if you are in an EC2 machine to begin with. So I do not understand how setting that variable to true can cause connection issues to S3.

Can you add these lines to the top of your script (these lines must come before you import any other pyarrow module)? This will add additional debugging information that might help us understand what is happening:

import pyarrow._s3fs
pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Trace)

shaktishp · 2023-07-13T07:27:25Z

Thanks for that info. Let me try that.

shaktishp · 2023-07-13T10:23:21Z

I am trying on my jupyter notebook. I can see the logs now printed but nothing i can highlight. Is there anything i have to look out for? Also one thing i noticed the that my code snippet never finishes.

Also i tried using AWS_EC2_METADATA_DISABLED=true and i got access denied.

westonpace · 2023-07-13T13:25:45Z

Is there anything i have to look out for? Also one thing i noticed the that my code snippet never finishes.

I'm interested in the timing. I was looking to see if the 970s was spent waiting for requests to be responded to from S3 or spent trying to resolve configuration or spent in some other way. It should be possible to determine these things by correlating the timestamps in the trace logging.

You could redirect the log output to a file using an approach like this.

Also i tried using AWS_EC2_METADATA_DISABLED=true and i got access denied.

You mentioned that your code is running in a kubernetes pod. Is this pod on an EC2 instance? Is this perhaps EKS?

shaktishp · 2023-07-13T14:10:40Z

So the time is taken when looping through the record batches. I will try to redirect the logs,

The code is running on EKS.

mapleFU · 2023-07-19T06:33:38Z

#36765 Would this be the same issue? @westonpace

westonpace · 2023-07-19T16:41:15Z

#36765 Would this be the same issue? @westonpace

I would not expect #36765 to cause this large of a delay.

Akshay-A-Kulkarni · 2023-07-19T16:41:34Z

@shaktishp could you try adding the running this ?

ds.dataset(
    'location', 
    format=ds.ParquetFileFormat(
        default_fragment_scan_options=ds.ParquetFragmentScanOptions(
            pre_buffer=True
        )
    )
)

This may speed up your s3 read time

xshirax · 2024-08-13T14:29:55Z

Hi @Akshay-A-Kulkarni
I stumble upon this issue since I had major time reading performance for files in s3 when using dataset.to_batches.
your suggestion in the previous comment helped A LOT.
do you care to explain what it does exactly and how does it affect the time of reading so much?
Thanks!

shaktishp added the Type: usage Issue is a user question label Jul 10, 2023

github-actions bot added Component: FlightRPC Component: Other Component: Parquet Component: Python labels Jul 10, 2023

mapleFU mentioned this issue Aug 12, 2023

[Python]The initialization of the S3FileSystem takes a long time. How to reduce the time? #37136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues when loading data from S3 #36587

Performance issues when loading data from S3 #36587

shaktishp commented Jul 10, 2023

mapleFU commented Jul 10, 2023

shaktishp commented Jul 10, 2023

westonpace commented Jul 10, 2023

shaktishp commented Jul 11, 2023

westonpace commented Jul 11, 2023

shaktishp commented Jul 12, 2023

westonpace commented Jul 12, 2023

shaktishp commented Jul 13, 2023

shaktishp commented Jul 13, 2023

westonpace commented Jul 13, 2023

shaktishp commented Jul 13, 2023

mapleFU commented Jul 19, 2023 •

edited

Loading

westonpace commented Jul 19, 2023

Akshay-A-Kulkarni commented Jul 19, 2023 •

edited

Loading

xshirax commented Aug 13, 2024

Performance issues when loading data from S3 #36587

Performance issues when loading data from S3 #36587

Comments

shaktishp commented Jul 10, 2023

Describe the usage question you have. Please include as many useful details as possible.

jupyter notebook code snippet

jupyter notebook and s3 is in same region

code ends here

Component(s)

mapleFU commented Jul 10, 2023

shaktishp commented Jul 10, 2023

westonpace commented Jul 10, 2023

shaktishp commented Jul 11, 2023

westonpace commented Jul 11, 2023

shaktishp commented Jul 12, 2023

westonpace commented Jul 12, 2023

shaktishp commented Jul 13, 2023

shaktishp commented Jul 13, 2023

westonpace commented Jul 13, 2023

shaktishp commented Jul 13, 2023

mapleFU commented Jul 19, 2023 • edited Loading

westonpace commented Jul 19, 2023

Akshay-A-Kulkarni commented Jul 19, 2023 • edited Loading

xshirax commented Aug 13, 2024

mapleFU commented Jul 19, 2023 •

edited

Loading

Akshay-A-Kulkarni commented Jul 19, 2023 •

edited

Loading