Create S3 client for smart_open from session #4886
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
When I kicked of the Rekognition DAG in production, we saw the following exception:
Upon investigating the internals of the
S3Hook
, it looked like theget_client_type
call was explicitly setting theendpoint_url
for thes3
client to bes3.amazonaws.com
, which something (eithersmart_open
orboto3
itself) was also tacking on anothers3
after the bucket name.https://github.com/apache/airflow/blob/1da4b146e954f78280dbe7bbbef452d58c8f728c/airflow/providers/amazon/aws/hooks/base_aws.py#L704
We didn't identify this sooner because we weren't using the production URL and didn't have any trouble when interacting with Minio on this front.
Description
This PR fixes this issue by using the
S3Hook::get_session
method and creating thes3
client from that instead, without explicitly setting any extra defaults. I tested this locally with the production S3 file and I was able to reproduce the original error, then verity that this change fixed the issue.In local testing I also ran into an issue where there appeared to be multiple labels for a single ID:
I've changed the insert SQL to use
replace=True
so that these conflicts are just replaced wholesale rather than raising an exception.Testing Instructions
Maintainers only
.env
which points to production (I had mine set toAIRFLOW_CONN_AWS_PROD
).env
forAIRFLOW_VAR_REKOGNITION_DATASET_PREFIX
(so the prod default is used)"aws_prod"
rather thanAWS_CONN_ID
add_rekognition_labels
DAG - if you're onmain
, this should fail with the above exception, if you're on this branch it should pass and processing should continue! (you can also add anif total_processed > 2000: break
line in thewhile
loop to prevent processing from running away)Checklist
Update index.md
).main
) or a parent feature branch.ov just catalog/generate-docs
for catalogPRs) or the media properties generator (
ov just catalog/generate-docs media-props
for the catalog or
ov just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin