Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues when loading data from S3 #36587

Open
shaktishp opened this issue Jul 10, 2023 · 15 comments
Open

Performance issues when loading data from S3 #36587

shaktishp opened this issue Jul 10, 2023 · 15 comments

Comments

@shaktishp
Copy link

Describe the usage question you have. Please include as many useful details as possible.

We are doing one small poc wherein we are comparing the performance when we load parquet files directly from s3 or from local file system

jupyter notebook code snippet

jupyter notebook and s3 is in same region

Import pyarrow.dataset as ds
import time

s3Dataset = ds.dataset(‘location’) # s3 or local file system
scanner = s3Dataset.scanner()
batches = scanner.to_batches()
st = time.time()

for batch in batches:
batch.num_rows

et = time.time()

elapsed = et - st

code ends here

total files 50 and total size of 500 mb

Time taken to read from s3 is 970 sec
Time taken to read from local file system 5 sec

Any insight why S3 is taking so much time?

Is there any settings we are missing when we are reading files from S3 which would give decent performance?

Any help appreciated.

Component(s)

FlightRPC, Parquet, Python, Other

@mapleFU
Copy link
Member

mapleFU commented Jul 10, 2023

Generally, reading from S3 would usally much slower than read from local file. However, I guess it downgrade too much. Would you mind do a profile and find which part is spending more time when blocking?

And to_batches can specify filter and some read-ahead arguments, configure read ctx and threads, I guess you can try to enlarge them

@shaktishp
Copy link
Author

Will do the profiling and share the details with you

@westonpace
Copy link
Member

S3 performance will depend of your connection to the S3 servers. Are you running this test on a local device? Or on an EC2 server?

If it's a local device then what kind of connection do you have to the internet? 970 seconds for 500MB of data is very slow. This is less than 0.1Mbps so any decent connection should be faster than this. If you set the environment variable AWS_EC2_METADATA_DISABLED (e.g. export AWS_EC2_METADATA_DISABLED=true) does it have any affect on performance?

@shaktishp
Copy link
Author

@westonpace The code is running on a kubernetes pod.

@westonpace
Copy link
Member

Are you able to try setting the environment variable I suggested?

@shaktishp
Copy link
Author

I tried setting the env variable AWS_EC2_METADATA_DISABLED=true but i was not able to connect to S3.

@westonpace
Copy link
Member

I am a little confused then.

When any operation is run by the S3 filesystem then the AWS SDK will attempt to determine credentials for that action. Typically this is done by looking in the user's config file (e.g. for ~/.aws/config). If this configuration file is not found then it will attempt to contact a special IP address that EC2 machines have configured which tells the EC2 machine what its configuration is.

This attempt to contact that special IP address can be very slow, depending on the network configuration of the machine (sometimes it will spend minutes waiting for a timeout). Setting variable AWS_EC2_METADATA_DISABLED will disable the check but that should only affect your connection if you are in an EC2 machine to begin with. So I do not understand how setting that variable to true can cause connection issues to S3.

Can you add these lines to the top of your script (these lines must come before you import any other pyarrow module)? This will add additional debugging information that might help us understand what is happening:

import pyarrow._s3fs
pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Trace)

@shaktishp
Copy link
Author

Thanks for that info. Let me try that.

@shaktishp
Copy link
Author

I am trying on my jupyter notebook. I can see the logs now printed but nothing i can highlight. Is there anything i have to look out for? Also one thing i noticed the that my code snippet never finishes.

Also i tried using AWS_EC2_METADATA_DISABLED=true and i got access denied.

@westonpace
Copy link
Member

Is there anything i have to look out for? Also one thing i noticed the that my code snippet never finishes.

I'm interested in the timing. I was looking to see if the 970s was spent waiting for requests to be responded to from S3 or spent trying to resolve configuration or spent in some other way. It should be possible to determine these things by correlating the timestamps in the trace logging.

You could redirect the log output to a file using an approach like this.

Also i tried using AWS_EC2_METADATA_DISABLED=true and i got access denied.

You mentioned that your code is running in a kubernetes pod. Is this pod on an EC2 instance? Is this perhaps EKS?

@shaktishp
Copy link
Author

So the time is taken when looping through the record batches. I will try to redirect the logs,

The code is running on EKS.

@mapleFU
Copy link
Member

mapleFU commented Jul 19, 2023

#36765 Would this be the same issue? @westonpace

@westonpace
Copy link
Member

#36765 Would this be the same issue? @westonpace

I would not expect #36765 to cause this large of a delay.

@Akshay-A-Kulkarni
Copy link

Akshay-A-Kulkarni commented Jul 19, 2023

@shaktishp could you try adding the running this ?

ds.dataset(
    'location', 
    format=ds.ParquetFileFormat(
        default_fragment_scan_options=ds.ParquetFragmentScanOptions(
            pre_buffer=True
        )
    )
)

This may speed up your s3 read time

@xshirax
Copy link

xshirax commented Aug 13, 2024

Hi @Akshay-A-Kulkarni
I stumble upon this issue since I had major time reading performance for files in s3 when using dataset.to_batches.
your suggestion in the previous comment helped A LOT.
do you care to explain what it does exactly and how does it affect the time of reading so much?
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants