Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python]The initialization of the S3FileSystem takes a long time. How to reduce the time? #37136

Closed
infzo opened this issue Aug 12, 2023 · 14 comments
Labels
Component: Python Type: usage Issue is a user question

Comments

@infzo
Copy link

infzo commented Aug 12, 2023

Describe the usage question you have. Please include as many useful details as possible.

The initialization of the S3FileSystem takes a long time. How to reduce the time?

from pyarrow import fs

import time

start = time.time()
obs_fs = fs.S3FileSystem(access_key="ak", secret_key="sk", endpoint_override="ep")
print(f'S3FileSystem cost {time.time() - start:.6f} s')

start = time.time()
obs_fs = fs.S3FileSystem(access_key="ak", secret_key="sk", endpoint_override="ep")
print(f'S3FileSystem cost {time.time() - start:.6f} s')

start = time.time()
obs_fs = fs.S3FileSystem(access_key="ak", secret_key="sk", endpoint_override="ep")
print(f'S3FileSystem cost {time.time() - start:.6f} s')


(base) [root@]# python main.py
S3FileSystem cost 6.009823 s
S3FileSystem cost 2.005619 s
S3FileSystem cost 2.005571 s
···

### Component(s)

Python
@infzo infzo added the Type: usage Issue is a user question label Aug 12, 2023
@mapleFU
Copy link
Member

mapleFU commented Aug 12, 2023

Would you mind set log level and see the logging of s3? I don't know how time spends here, need more log to see why initialize is slow

@infzo
Copy link
Author

infzo commented Aug 12, 2023

Would you mind set log level and see the logging of s3? I don't know how time spends here, need more log to see why initialize is slow

I tried it, and it seems that the retry connection timed out: 169.254.169.254:80.

start = time.time()
pyarrow.fs.initialize_s3(fs.S3LogLevel.Debug)
obs_fs = fs.S3FileSystem(access_key="ak", secret_key="sk", endpoint_override="ep")
pyarrow.fs.finalize_s3()
print(f'S3FileSystem cost {time.time() - start:.6f} s')



[DEBUG] 2023-08-12 07:59:36.598 task-scheduler [140450938611456] id=0x7fbd40000b80: Scheduling epoll_event_loop_unsubscribe_cleanup task for immediate execution
[DEBUG] 2023-08-12 07:59:36.598 task-scheduler [140444465874688] id=0x7fbd40000b80: Running epoll_event_loop_unsubscribe_cleanup task with <Canceled> status
[DEBUG] 2023-08-12 07:59:36.623 HttpClientFactory [140463560197952] Initializing Http Static State
[DEBUG] 2023-08-12 07:59:36.623 HttpClientFactory [140463560197952] Initializing Curl Http Client
[DEBUG] 2023-08-12 07:59:36.624 ClientConfiguration [140463560197952] ClientConfiguration will use SDK Auto Resolved profile: [default] if not specified by users.
[WARN] 2023-08-12 07:59:36.624 ClientConfiguration [140463560197952] Retry Strategy will use the default max attempts.
[DEBUG] 2023-08-12 07:59:36.624 CurlHandleContainer [140463560197952] Attempting to acquire curl connection.
[DEBUG] 2023-08-12 07:59:36.624 CurlHandleContainer [140463560197952] No current connections available in pool. Attempting to create new connections.
[DEBUG] 2023-08-12 07:59:36.624 CurlHandleContainer [140463560197952] attempting to grow pool size by 2
[INFO] 2023-08-12 07:59:36.624 CurlHandleContainer [140463560197952] Pool grown by 2
[INFO] 2023-08-12 07:59:36.624 CurlHandleContainer [140463560197952] Connection has been released. Continuing.
[DEBUG] 2023-08-12 07:59:36.624 CurlHandleContainer [140463560197952] Returning connection handle 0x55ffc94fe330
[DEBUG] 2023-08-12 07:59:36.624 CurlHttpClient [140463560197952] Obtained connection handle 0x55ffc94fe330
[DEBUG] 2023-08-12 07:59:36.624 CURL [140463560197952] (Text)   Trying 169.254.169.254:80...

[DEBUG] 2023-08-12 07:59:37.625 CURL [140463560197952] (Text) ipv4 connect timeout after 1000ms, move on!

[DEBUG] 2023-08-12 07:59:37.625 CURL [140463560197952] (Text) Failed to connect to 169.254.169.254 port 80 after 1001 ms: Timeout was reached

[DEBUG] 2023-08-12 07:59:37.625 CURL [140463560197952] (Text) Closing connection 0

[ERROR] 2023-08-12 07:59:37.625 CurlHttpClient [140463560197952] Curl returned error code 28 - Timeout was reached
[DEBUG] 2023-08-12 07:59:37.625 CurlHandleContainer [140463560197952] Destroy curl handle: 0x55ffc94fe330
[DEBUG] 2023-08-12 07:59:37.625 CurlHandleContainer [140463560197952] Created replacement handle and released to pool: 0x55ffc94fe330
[ERROR] 2023-08-12 07:59:37.625 EC2MetadataClient [140463560197952] Http request to retrieve credentials failed
[WARN] 2023-08-12 07:59:37.625 EC2MetadataClient [140463560197952] Request failed, now waiting 0 ms before attempting again.
[DEBUG] 2023-08-12 07:59:37.625 CurlHandleContainer [140463560197952] Attempting to acquire curl connection.
[INFO] 2023-08-12 07:59:37.625 CurlHandleContainer [140463560197952] Connection has been released. Continuing.
[DEBUG] 2023-08-12 07:59:37.625 CurlHandleContainer [140463560197952] Returning connection handle 0x55ffc94fe330
[DEBUG] 2023-08-12 07:59:37.625 CurlHttpClient [140463560197952] Obtained connection handle 0x55ffc94fe330
[DEBUG] 2023-08-12 07:59:37.625 CURL [140463560197952] (Text)   Trying 169.254.169.254:80...

[DEBUG] 2023-08-12 07:59:38.626 CURL [140463560197952] (Text) ipv4 connect timeout after 1000ms, move on!

[DEBUG] 2023-08-12 07:59:38.626 CURL [140463560197952] (Text) Failed to connect to 169.254.169.254 port 80 after 1001 ms: Timeout was reached

[DEBUG] 2023-08-12 07:59:38.626 CURL [140463560197952] (Text) Closing connection 0

[ERROR] 2023-08-12 07:59:38.626 CurlHttpClient [140463560197952] Curl returned error code 28 - Timeout was reached
[DEBUG] 2023-08-12 07:59:38.627 CurlHandleContainer [140463560197952] Destroy curl handle: 0x55ffc94fe330
[DEBUG] 2023-08-12 07:59:38.627 CurlHandleContainer [140463560197952] Created replacement handle and released to pool: 0x55ffc94fe330
[ERROR] 2023-08-12 07:59:38.627 EC2MetadataClient [140463560197952] Http request to retrieve credentials failed
[ERROR] 2023-08-12 07:59:38.627 EC2MetadataClient [140463560197952] Can not retrieve resource from http://169.254.169.254/latest/api/token
[DEBUG] 2023-08-12 07:59:38.627 CurlHandleContainer [140463560197952] Attempting to acquire curl connection.
[INFO] 2023-08-12 07:59:38.627 CurlHandleContainer [140463560197952] Connection has been released. Continuing.
[DEBUG] 2023-08-12 07:59:38.627 CurlHandleContainer [140463560197952] Returning connection handle 0x55ffc94fe330
[DEBUG] 2023-08-12 07:59:38.627 CurlHttpClient [140463560197952] Obtained connection handle 0x55ffc94fe330
[DEBUG] 2023-08-12 07:59:38.627 CURL [140463560197952] (Text)   Trying 169.254.169.254:80...

[DEBUG] 2023-08-12 07:59:39.628 CURL [140463560197952] (Text) ipv4 connect timeout after 1000ms, move on!

[DEBUG] 2023-08-12 07:59:39.628 CURL [140463560197952] (Text) Failed to connect to 169.254.169.254 port 80 after 1001 ms: Timeout was reached

[DEBUG] 2023-08-12 07:59:39.628 CURL [140463560197952] (Text) Closing connection 0

[ERROR] 2023-08-12 07:59:39.628 CurlHttpClient [140463560197952] Curl returned error code 28 - Timeout was reached
[DEBUG] 2023-08-12 07:59:39.628 CurlHandleContainer [140463560197952] Destroy curl handle: 0x55ffc94fe330
[DEBUG] 2023-08-12 07:59:39.628 CurlHandleContainer [140463560197952] Created replacement handle and released to pool: 0x55ffc94fe330
[ERROR] 2023-08-12 07:59:39.628 EC2MetadataClient [140463560197952] Http request to retrieve credentials failed
[WARN] 2023-08-12 07:59:39.628 EC2MetadataClient [140463560197952] Request failed, now waiting 0 ms before attempting again.
[DEBUG] 2023-08-12 07:59:39.628 CurlHandleContainer [140463560197952] Attempting to acquire curl connection.
[INFO] 2023-08-12 07:59:39.628 CurlHandleContainer [140463560197952] Connection has been released. Continuing.
[DEBUG] 2023-08-12 07:59:39.628 CurlHandleContainer [140463560197952] Returning connection handle 0x55ffc94fe330
[DEBUG] 2023-08-12 07:59:39.628 CurlHttpClient [140463560197952] Obtained connection handle 0x55ffc94fe330
[DEBUG] 2023-08-12 07:59:39.628 CURL [140463560197952] (Text)   Trying 169.254.169.254:80...

[DEBUG] 2023-08-12 07:59:40.629 CURL [140463560197952] (Text) ipv4 connect timeout after 1000ms, move on!

[DEBUG] 2023-08-12 07:59:40.629 CURL [140463560197952] (Text) Failed to connect to 169.254.169.254 port 80 after 1001 ms: Timeout was reached

[DEBUG] 2023-08-12 07:59:40.629 CURL [140463560197952] (Text) Closing connection 0

[ERROR] 2023-08-12 07:59:40.629 CurlHttpClient [140463560197952] Curl returned error code 28 - Timeout was reached
[DEBUG] 2023-08-12 07:59:40.629 CurlHandleContainer [140463560197952] Destroy curl handle: 0x55ffc94fe330
[DEBUG] 2023-08-12 07:59:40.629 CurlHandleContainer [140463560197952] Created replacement handle and released to pool: 0x55ffc94fe330
[ERROR] 2023-08-12 07:59:40.629 EC2MetadataClient [140463560197952] Http request to retrieve credentials failed
[ERROR] 2023-08-12 07:59:40.629 EC2MetadataClient [140463560197952] Can not retrieve resource from http://169.254.169.254/latest/meta-data/iam/security-credentials
[DEBUG] 2023-08-12 07:59:40.629 CurlHandleContainer [140463560197952] Attempting to acquire curl connection.
[INFO] 2023-08-12 07:59:40.629 CurlHandleContainer [140463560197952] Connection has been released. Continuing.
[DEBUG] 2023-08-12 07:59:40.629 CurlHandleContainer [140463560197952] Returning connection handle 0x55ffc94fe330
[DEBUG] 2023-08-12 07:59:40.629 CurlHttpClient [140463560197952] Obtained connection handle 0x55ffc94fe330
[DEBUG] 2023-08-12 07:59:40.629 CURL [140463560197952] (Text)   Trying 169.254.169.254:80...

[DEBUG] 2023-08-12 07:59:41.631 CURL [140463560197952] (Text) ipv4 connect timeout after 1000ms, move on!

[DEBUG] 2023-08-12 07:59:41.631 CURL [140463560197952] (Text) Failed to connect to 169.254.169.254 port 80 after 1001 ms: Timeout was reached

[DEBUG] 2023-08-12 07:59:41.631 CURL [140463560197952] (Text) Closing connection 0

[ERROR] 2023-08-12 07:59:41.631 CurlHttpClient [140463560197952] Curl returned error code 28 - Timeout was reached
[DEBUG] 2023-08-12 07:59:41.631 CurlHandleContainer [140463560197952] Destroy curl handle: 0x55ffc94fe330
[DEBUG] 2023-08-12 07:59:41.631 CurlHandleContainer [140463560197952] Created replacement handle and released to pool: 0x55ffc94fe330
[ERROR] 2023-08-12 07:59:41.631 EC2MetadataClient [140463560197952] Http request to retrieve credentials failed
[WARN] 2023-08-12 07:59:41.631 EC2MetadataClient [140463560197952] Request failed, now waiting 0 ms before attempting again.
[DEBUG] 2023-08-12 07:59:41.631 CurlHandleContainer [140463560197952] Attempting to acquire curl connection.
[INFO] 2023-08-12 07:59:41.631 CurlHandleContainer [140463560197952] Connection has been released. Continuing.
[DEBUG] 2023-08-12 07:59:41.631 CurlHandleContainer [140463560197952] Returning connection handle 0x55ffc94fe330
[DEBUG] 2023-08-12 07:59:41.631 CurlHttpClient [140463560197952] Obtained connection handle 0x55ffc94fe330
[DEBUG] 2023-08-12 07:59:41.631 CURL [140463560197952] (Text)   Trying 169.254.169.254:80...

[DEBUG] 2023-08-12 07:59:42.632 CURL [140463560197952] (Text) ipv4 connect timeout after 1000ms, move on!

[DEBUG] 2023-08-12 07:59:42.632 CURL [140463560197952] (Text) Failed to connect to 169.254.169.254 port 80 after 1001 ms: Timeout was reached

[DEBUG] 2023-08-12 07:59:42.632 CURL [140463560197952] (Text) Closing connection 0

[ERROR] 2023-08-12 07:59:42.632 CurlHttpClient [140463560197952] Curl returned error code 28 - Timeout was reached
[DEBUG] 2023-08-12 07:59:42.632 CurlHandleContainer [140463560197952] Destroy curl handle: 0x55ffc94fe330
[DEBUG] 2023-08-12 07:59:42.632 CurlHandleContainer [140463560197952] Created replacement handle and released to pool: 0x55ffc94fe330
[ERROR] 2023-08-12 07:59:42.632 EC2MetadataClient [140463560197952] Http request to retrieve credentials failed
[ERROR] 2023-08-12 07:59:42.632 EC2MetadataClient [140463560197952] Can not retrieve resource from http://169.254.169.254/latest/meta-data/placement/availability-zone
[INFO] 2023-08-12 07:59:42.632 EC2MetadataClient [140463560197952] Unable to pull region from instance metadata service
[INFO] 2023-08-12 07:59:42.632 CurlHandleContainer [140463560197952] Initializing CurlHandleContainer with size 25

S3FileSystem cost 6.046947 s

@mapleFU
Copy link
Member

mapleFU commented Aug 12, 2023

Seems that "connect timeout" is related to network environment...Maybe you can enlarge timeout here...

@infzo
Copy link
Author

infzo commented Aug 12, 2023

Seems that "connect timeout" is related to network environment...Maybe you can enlarge timeout here...

In the S3FileSystem interface, there is no timeout option. How to do this?

@mapleFU
Copy link
Member

mapleFU commented Aug 12, 2023

/// Options for the S3FileSystem implementation.
struct ARROW_EXPORT S3Options {
  /// \brief AWS region to connect to.
  ///
  /// If unset, the AWS SDK will choose a default value.  The exact algorithm
  /// depends on the SDK version.  Before 1.8, the default is hardcoded
  /// to "us-east-1".  Since 1.8, several heuristics are used to determine
  /// the region (environment variables, configuration profile, EC2 metadata
  /// server).
  std::string region;

  /// \brief Socket connection timeout, in seconds
  ///
  /// If negative, the AWS SDK default value is used (typically 1 second).
  double connect_timeout = -1;

I'm not familiar with Python code, maybe you could search S3Options. Maybe you can take a look at: https://github.com/apache/arrow/blob/main/python/pyarrow/_s3fs.pyx#L202

By the way, maybe you can switch to a better network env...

@infzo
Copy link
Author

infzo commented Aug 12, 2023

/// Options for the S3FileSystem implementation.
struct ARROW_EXPORT S3Options {
  /// \brief AWS region to connect to.
  ///
  /// If unset, the AWS SDK will choose a default value.  The exact algorithm
  /// depends on the SDK version.  Before 1.8, the default is hardcoded
  /// to "us-east-1".  Since 1.8, several heuristics are used to determine
  /// the region (environment variables, configuration profile, EC2 metadata
  /// server).
  std::string region;

  /// \brief Socket connection timeout, in seconds
  ///
  /// If negative, the AWS SDK default value is used (typically 1 second).
  double connect_timeout = -1;

I'm not familiar with Python code, maybe you could search S3Options. Maybe you can take a look at: https://github.com/apache/arrow/blob/main/python/pyarrow/_s3fs.pyx#L202

By the way, maybe you can switch to a better network env...

It's my oversight, I'm using arrow 9.0.0, and there's no timeout option yet.

@mapleFU
Copy link
Member

mapleFU commented Aug 12, 2023

No idea how to solve it in 9.0.0, maybe you can upgrade or patch the timeout option to your code :-)

@infzo
Copy link
Author

infzo commented Aug 12, 2023

/// Options for the S3FileSystem implementation.
struct ARROW_EXPORT S3Options {
  /// \brief AWS region to connect to.
  ///
  /// If unset, the AWS SDK will choose a default value.  The exact algorithm
  /// depends on the SDK version.  Before 1.8, the default is hardcoded
  /// to "us-east-1".  Since 1.8, several heuristics are used to determine
  /// the region (environment variables, configuration profile, EC2 metadata
  /// server).
  std::string region;

  /// \brief Socket connection timeout, in seconds
  ///
  /// If negative, the AWS SDK default value is used (typically 1 second).
  double connect_timeout = -1;

I'm not familiar with Python code, maybe you could search S3Options. Maybe you can take a look at: https://github.com/apache/arrow/blob/main/python/pyarrow/_s3fs.pyx#L202

By the way, maybe you can switch to a better network env...

In addition, I guess it's not the network environment, because reading obs' parquet is successful. There is a high probability that the OBS service provider is not AWS.

@infzo
Copy link
Author

infzo commented Aug 12, 2023

No idea how to solve it in 9.0.0, maybe you can upgrade or patch the timeout option to your code :-)

Thank you! At least I found out why. I'll see if there's another way to fix it.

@mapleFU
Copy link
Member

mapleFU commented Aug 12, 2023

Maybe I've heard of this problem before, would you also like to take a look at: #36587 (comment) ?

You can also check the object store provides configurations for help.

@infzo
Copy link
Author

infzo commented Aug 12, 2023

Maybe I've heard of this problem before, would you also like to take a look at: #36587 (comment) ?

You can also check the object store provides configurations for help.

It's amazing that this environment variable works.

export AWS_EC2_METADATA_DISABLED=true

S3FileSystem cost 0.002158 s
S3FileSystem cost 0.002582 s

Thank you very much!

@mapleFU
Copy link
Member

mapleFU commented Aug 12, 2023

Close as solved

@mapleFU mapleFU closed this as completed Aug 12, 2023
@mapleFU
Copy link
Member

mapleFU commented Aug 12, 2023

 DefaultAWSCredentialsProviderChain does not query (inaccessible) EC2 metadata endpoint.

Seems it would query endpoint for that . Also see aws/aws-cli#5623

@infzo
Copy link
Author

infzo commented Aug 12, 2023

 DefaultAWSCredentialsProviderChain does not query (inaccessible) EC2 metadata endpoint.

Seems it would query endpoint for that . Also see aws/aws-cli#5623

ok! It seems that explicitly specifying the region can also circumvent the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Python Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

2 participants