Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sftp_to_s3 stream file option #17609

Merged
merged 7 commits into from
Sep 8, 2021
Merged

sftp_to_s3 stream file option #17609

merged 7 commits into from
Sep 8, 2021

Conversation

john-jac
Copy link
Contributor

Adds the option to stream the file directly from sftp client to s3 rather than first copy to a local temporary file. This is required whenever the size of the file exceeds the temporary storage of the worker.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

Adds the option to stream the file directly from sftp client to s3 rather than first copy to a local temporary file.  This is required whenever the size of the file exceeds the temporary storage of the worker.
@boring-cyborg boring-cyborg bot added area:providers provider:amazon-aws AWS/Amazon - related issues labels Aug 13, 2021
fixed self.use_temp_file reference error
@potiuk
Copy link
Member

potiuk commented Aug 14, 2021

Two things:

  1. Static checks are failing (i heartily recommend installing pre-commit to catch those kind of checks at commit time (saves a lot on iterating with the change)

  2. Unit tests should be added. We are very keen on getting all the new functionality covered by unit tests - any change that adds some functionality should have accompanying unit-tests to avoid regressions.

@JavierLopezT
Copy link
Contributor

Is there any advantage on saving the file locally in a temporary manner? I am wondering if it makes sense to just change the way it uploads the file to S3 without giving the option to store the temporary file in local system

@potiuk
Copy link
Member

potiuk commented Aug 15, 2021

Is there any advantage on saving the file locally in a temporary manner? I am wondering if it makes sense to just change the way it uploads the file to S3 without giving the option to store the temporary file in local system

I think the main reason are implementation details of the upload_fileobj. It's not really obvious how the data is buffered while upload_fileobj runs so there might be significant memory usage during this operation. But the main reason is that from what I see the description of upload_fileobj, whenever possible it will use multiple threads and upload s3 object in parallel (which - I know for a fact) can speed up the s3 upload immensely (this is how S3 upload is designed). However (my guess but quite likely), this cannot be done if the "fileobj" does not provide "seek()" functionality. Looking how sftp get is implemented, it's fileobj does not allow seek, it can only read the file sequentially (this is how sftp protocol works I believe). It could only provide "seek" if it loaded the file entirely in memory first (but this would not be good for huge files).

So if you have a fast (local network) sftp connection, downloading the file first and then uploading the local file might significantly speed up the transfer, as upload_fileobj will be able to utilise multiple threads to upload. That's moslty educated guess, but I think it's very likely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants