-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issues when loading data from S3 #36587
Comments
Generally, reading from S3 would usally much slower than read from local file. However, I guess it downgrade too much. Would you mind do a profile and find which part is spending more time when blocking? And |
Will do the profiling and share the details with you |
S3 performance will depend of your connection to the S3 servers. Are you running this test on a local device? Or on an EC2 server? If it's a local device then what kind of connection do you have to the internet? 970 seconds for 500MB of data is very slow. This is less than 0.1Mbps so any decent connection should be faster than this. If you set the environment variable |
@westonpace The code is running on a kubernetes pod. |
Are you able to try setting the environment variable I suggested? |
I tried setting the env variable AWS_EC2_METADATA_DISABLED=true but i was not able to connect to S3. |
I am a little confused then. When any operation is run by the S3 filesystem then the AWS SDK will attempt to determine credentials for that action. Typically this is done by looking in the user's config file (e.g. for ~/.aws/config). If this configuration file is not found then it will attempt to contact a special IP address that EC2 machines have configured which tells the EC2 machine what its configuration is. This attempt to contact that special IP address can be very slow, depending on the network configuration of the machine (sometimes it will spend minutes waiting for a timeout). Setting variable Can you add these lines to the top of your script (these lines must come before you import any other pyarrow module)? This will add additional debugging information that might help us understand what is happening:
|
Thanks for that info. Let me try that. |
I am trying on my jupyter notebook. I can see the logs now printed but nothing i can highlight. Is there anything i have to look out for? Also one thing i noticed the that my code snippet never finishes. Also i tried using AWS_EC2_METADATA_DISABLED=true and i got access denied. |
I'm interested in the timing. I was looking to see if the 970s was spent waiting for requests to be responded to from S3 or spent trying to resolve configuration or spent in some other way. It should be possible to determine these things by correlating the timestamps in the trace logging. You could redirect the log output to a file using an approach like this.
You mentioned that your code is running in a kubernetes pod. Is this pod on an EC2 instance? Is this perhaps EKS? |
So the time is taken when looping through the record batches. I will try to redirect the logs, The code is running on EKS. |
#36765 Would this be the same issue? @westonpace |
I would not expect #36765 to cause this large of a delay. |
@shaktishp could you try adding the running this ? ds.dataset(
'location',
format=ds.ParquetFileFormat(
default_fragment_scan_options=ds.ParquetFragmentScanOptions(
pre_buffer=True
)
)
)
This may speed up your s3 read time |
Hi @Akshay-A-Kulkarni |
Describe the usage question you have. Please include as many useful details as possible.
We are doing one small poc wherein we are comparing the performance when we load parquet files directly from s3 or from local file system
jupyter notebook code snippet
jupyter notebook and s3 is in same region
Import pyarrow.dataset as ds
import time
s3Dataset = ds.dataset(‘location’) # s3 or local file system
scanner = s3Dataset.scanner()
batches = scanner.to_batches()
st = time.time()
for batch in batches:
batch.num_rows
et = time.time()
elapsed = et - st
code ends here
total files 50 and total size of 500 mb
Time taken to read from s3 is 970 sec
Time taken to read from local file system 5 sec
Any insight why S3 is taking so much time?
Is there any settings we are missing when we are reading files from S3 which would give decent performance?
Any help appreciated.
Component(s)
FlightRPC, Parquet, Python, Other
The text was updated successfully, but these errors were encountered: