Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support additional configuration parameters for S3 transfer #4359

Closed
sserdyukov opened this issue Aug 7, 2020 · 5 comments · Fixed by #5378
Closed

Support additional configuration parameters for S3 transfer #4359

sserdyukov opened this issue Aug 7, 2020 · 5 comments · Fixed by #5378
Assignees
Labels
feature request Requesting a new feature help wanted p2-medium Medium priority, should be done, but less important

Comments

@sserdyukov
Copy link

sserdyukov commented Aug 7, 2020

Problem
When using S3 as remote for DVC, boto3 reads also the configuration file ~/.aws/config typically used by the original AWS CLI tool. Not all configuration parameters from this file have an effect in DVC at the moment. For instance, the last line multipart_threshold = 500MB will be skipped by DVC while the previous line signature_version = s3 will be applied when pushing data into S3:

[default]
output = json
s3 =
  signature_version = s3
  multipart_threshold = 500MB

Suggestion
Suggestion is to support optional yet useful parameters from ~/.aws/config as per AWS CLI S3 documentation to control S3 transfers for performance reasons or to unlock the advanced configuration capabilities.

These parameters should be passed to TransferConfig:

  • max_concurrent_requests - The maximum number of concurrent requests.
  • max_queue_size - The maximum number of tasks in the task queue.
  • multipart_threshold - The size threshold the CLI uses for multipart transfers of individual files.
  • multipart_chunksize - When using multipart transfers, this is the chunk size that the CLI uses for multipart transfers of individual files.
  • max_bandwidth - The maximum bandwidth that will be consumed for uploading and downloading data to and from Amazon S3.

If it is reasonable and doable - support also the following set of parameters:

  • use_accelerate_endpoint - Use the Amazon S3 Accelerate endpoint for all s3 and s3api commands. You must first enable S3 Accelerate on your bucket before attempting to use the endpoint. This is mutually exclusive with the use_dualstack_endpoint option.
  • use_dualstack_endpoint - Use the Amazon S3 dual IPv4 / IPv6 endpoint for all s3 and s3api commands. This is mutually exclusive with the use_accelerate_endpoint option.
  • addressing_style - Specifies which addressing style to use. This controls if the bucket name is in the hostname or part of the URL. Value values are: path, virtual, and auto. The default value is auto.
  • payload_signing_enabled - Refers to whether or not to SHA256 sign sigv4 payloads. By default, this is disabled for streaming uploads (UploadPart and PutObject) when using https.

These values must be set under the top level s3 key in the AWS Config File, which has a default location of ~/.aws/config. Below is an example configuration:

[profile development]
aws_access_key_id=foo
aws_secret_access_key=bar
s3 =
  max_concurrent_requests = 20
  max_queue_size = 10000
  multipart_threshold = 64MB
  multipart_chunksize = 16MB
  max_bandwidth = 50MB/s
  use_accelerate_endpoint = true
  addressing_style = path
@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Aug 7, 2020
@efiop efiop added feature request Requesting a new feature help wanted p3-nice-to-have It should be done this or next sprint labels Aug 8, 2020
@efiop efiop added p2-medium Medium priority, should be done, but less important and removed p3-nice-to-have It should be done this or next sprint labels Aug 28, 2020
@efiop
Copy link
Contributor

efiop commented Aug 28, 2020

For the record: seems like those options are handled by aws-cli specifically, and not by the boto3 itself: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html . This will require parsing the config ourselves and applying the options where needed, inspired by aws-cli(need to look into aws-cli source to make it match). All the code related to s3 and s3 configs is in https://github.com/iterative/dvc/blob/master/dvc/tree/s3.py.

@efiop efiop removed the triage Needs to be triaged label Aug 28, 2020
@isidentical
Copy link
Contributor

Another case where having this would help: https://discord.com/channels/485586884165107732/485596304961962003/804679936886571019 (when the individual files are well over 10GB, and boto3 tries to chunk them in little parts)

@isidentical
Copy link
Contributor

@efiop I think at least we can support a subset of these variables, the stuff that goes into a TransferConfig (multipart_chunk_size, multipart_threshold etc) which is the real deal since it is causing people trouble when trying to upload huge files. What do you think?

@efiop efiop assigned efiop and isidentical and unassigned efiop Jan 29, 2021
@BMarcin
Copy link

BMarcin commented Jan 29, 2021

According to the issue from discord. I'm using S3 compatibile API, so could You also include those parameters (multipart_chunk_size, multipart_threshold) in .dvc/config? In that way .dvc/config would override the values saved in ~/.aws/config.

@isidentical
Copy link
Contributor

According to the issue from discord. I'm using S3 compatibile API, so could You also include those parameters (multipart_chunk_size, multipart_threshold) in .dvc/config? In that way .dvc/config would override the values saved in ~/.aws/config.

@efiop what do you about this? Should we only support ~/.aws/config to specific configurations for s3 or have the exact same configuration options in both ~/.dvc/config and ~/.aws/config?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature help wanted p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants