sftp_to_s3 stream file option #17609

john-jac · 2021-08-13T20:20:58Z

Adds the option to stream the file directly from sftp client to s3 rather than first copy to a local temporary file. This is required whenever the size of the file exceeds the temporary storage of the worker.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

Adds the option to stream the file directly from sftp client to s3 rather than first copy to a local temporary file. This is required whenever the size of the file exceeds the temporary storage of the worker.

fixed self.use_temp_file reference error

potiuk · 2021-08-14T16:03:34Z

Two things:

Static checks are failing (i heartily recommend installing pre-commit to catch those kind of checks at commit time (saves a lot on iterating with the change)
Unit tests should be added. We are very keen on getting all the new functionality covered by unit tests - any change that adds some functionality should have accompanying unit-tests to avoid regressions.

JavierLopezT · 2021-08-15T11:15:46Z

Is there any advantage on saving the file locally in a temporary manner? I am wondering if it makes sense to just change the way it uploads the file to S3 without giving the option to store the temporary file in local system

potiuk · 2021-08-15T14:00:31Z

Is there any advantage on saving the file locally in a temporary manner? I am wondering if it makes sense to just change the way it uploads the file to S3 without giving the option to store the temporary file in local system

I think the main reason are implementation details of the upload_fileobj. It's not really obvious how the data is buffered while upload_fileobj runs so there might be significant memory usage during this operation. But the main reason is that from what I see the description of upload_fileobj, whenever possible it will use multiple threads and upload s3 object in parallel (which - I know for a fact) can speed up the s3 upload immensely (this is how S3 upload is designed). However (my guess but quite likely), this cannot be done if the "fileobj" does not provide "seek()" functionality. Looking how sftp get is implemented, it's fileobj does not allow seek, it can only read the file sequentially (this is how sftp protocol works I believe). It could only provide "seek" if it loaded the file entirely in memory first (but this would not be good for huge files).

So if you have a fast (local network) sftp connection, downloading the file first and then uploading the local file might significantly speed up the transfer, as upload_fileobj will be able to utilise multiple threads to upload. That's moslty educated guess, but I think it's very likely.

sftp_to_s3 stream file option

ab7e6f6

Adds the option to stream the file directly from sftp client to s3 rather than first copy to a local temporary file. This is required whenever the size of the file exceeds the temporary storage of the worker.

boring-cyborg bot added area:providers provider:amazon-aws AWS/Amazon - related issues labels Aug 13, 2021

Update sftp_to_s3.py

0c950f6

fixed self.use_temp_file reference error

subkanthi approved these changes Aug 15, 2021

View reviewed changes

john-jac added 5 commits September 7, 2021 09:55

Update sftp_to_s3.py

4c8e2c3

Add unit test with use_temp_file = False

43ecd3c

Merge branch 'apache:main' into patch-1

c9dd85b

Update sftp_to_s3.py

58437c7

Update test_sftp_to_s3.py

5d34638

potiuk approved these changes Sep 8, 2021

View reviewed changes

potiuk merged commit 3fe948a into apache:main Sep 8, 2021

potiuk mentioned this pull request Sep 30, 2021

Status of testing Providers that were prepared on September 30, 2021 #18638

Closed

56 tasks

potiuk mentioned this pull request Oct 8, 2021

Status of testing Amzon Provider 2.3.0rc2 that was prepared on October 08, 2021 #18835

Closed

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sftp_to_s3 stream file option #17609

sftp_to_s3 stream file option #17609

john-jac commented Aug 13, 2021

potiuk commented Aug 14, 2021

JavierLopezT commented Aug 15, 2021

potiuk commented Aug 15, 2021 •

edited

Loading

sftp_to_s3 stream file option #17609

sftp_to_s3 stream file option #17609

Conversation

john-jac commented Aug 13, 2021

potiuk commented Aug 14, 2021

JavierLopezT commented Aug 15, 2021

potiuk commented Aug 15, 2021 • edited Loading

potiuk commented Aug 15, 2021 •

edited

Loading