Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAG for generating and inserting Rekognition labels #4836

Merged
merged 29 commits into from
Sep 4, 2024

Conversation

AetherUnbound
Copy link
Collaborator

Fixes

Fixes #4645 by @AetherUnbound

Description

This PR adds a DAG which will parse and insert Rekognition labels into the catalog database. See the issue above for motivation and steps, and the generated DAG documentation for a description of the DAG. In short, this DAG iterates over the contents of the labels file in chunks, and inserts the records into a temporary table. The position in the file is tracked, and parsing can be resumed if any failures occur. Once the records are in the temporary table, a batched_update run is triggered to upsert the tags from the temporary table into the base image table.

In order to make this possible, I've made a few changes to other parts of the project:

  • This new DAG is going under a new header, data_augmentation, where I imagine other augmentation DAGs will go in the future (e.g. the dead link pipeline)
  • A small change to the load_to_s3_entrypoint.sh to ensure that the test files get loaded in properly once Minio is initialized
  • The addition of a new dependency, smart_open, to facilitate seamless chunked reading from S3
  • The addition of a subset of labels (200 records) which can be used for testing and local runs
  • Adding an additional_where parameter to the batched_update DAG in order to facilitate reading from a separate table as part of the insert
  • Skip spelling checks on our test S3 data

Warning

Due to the addition of a new dependency, the catalog should be deployed immediately after this DAG is merged.

image

Testing Instructions

  1. Run ov just down -v to clear everything
  2. Add the following values to your catalog/.env file:
# S3 prefix location for the Rekognition dataset
AIRFLOW_VAR_REKOGNITION_DATASET_PREFIX=image_analysis_labels.jsonl
# In-memory buffer size for processing Rekognition data
AIRFLOW_VAR_REKOGNITION_MEMORY_BUFFER_SIZE=25
# File buffer size for reading Rekognition data from S3
AIRFLOW_VAR_REKOGNITION_FILE_BUFFER_SIZE=16384
  1. Run ov just api/init to initialize the catalog database with data
  2. Run ov just c to start the catalog services (bonus: run ov just logs load_to_s3 to verify the new file gets loaded into Minio)
  3. Enable the batched_update DAG
  4. Trigger a run of add_rekognition_labels and wait for it to complete
  5. Verify that both DAGs have run successfully
  6. Check the notify_parse_complete logs to ensure 200 were loaded and 1 was skipped
  7. Run ov just catalog/pgcli then \x on (to show full results) then SELECT identifier, tags FROM image WHERE identifier = 'b840de61-fb9d-4ec5-9572-8d778875869f'; - both Flickr and Rekognition tags should show up for this result
  8. Bonus: Data refresh tests
    1. Trigger a run of the image_data_refresh DAG by enabling it. Once that is complete, the Rekognition results should NOT show up in the API due to our wholesale filtering of Rekognition (e.g. http://localhost:50280/v1/images/b840de61-fb9d-4ec5-9572-8d778875869f/ should not have any Rekognition results)
    2. Remove the filtered provider here:
      FILTERED_TAG_PROVIDERS = {"rekognition"}
      and run the previous step again. You should now see the Rekognition tags!
  9. Bonus: Set the SLACK_MESSAGE_OVERRIDE Variable to true and run the add_rekognition_labels DAG again to see the Slack messages that would be generated from a given run.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (ov just catalog/generate-docs for catalog
    PRs) or the media properties generator (ov just catalog/generate-docs media-props
    for the catalog or ov just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@AetherUnbound AetherUnbound requested review from a team as code owners August 29, 2024 15:10
@AetherUnbound AetherUnbound removed the request for review from a team August 29, 2024 15:10
@AetherUnbound AetherUnbound requested review from obulat, dhruvkb, krysal and stacimc and removed request for a team August 29, 2024 15:10
@openverse-bot openverse-bot added 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: documentation Related to Sphinx documentation 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository labels Aug 29, 2024
Copy link

Full-stack documentation: https://docs.openverse.org/_preview/4836

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

Changed files 🔄:

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic, super clear and I really love the way you've handled the variables here, something for me to think about! This tested well for me; I tried all the bonus testing instructions and got to see the tags excluded from the data refresh and then included once the provider was removed from the filtering. I also tested throwing an error midway through and then retriggering the DAG, worked great 🥳 My only blocking request is the comment about accurately reporting the failure count.



@task
def run_sql(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I'm also pulling this out in #4833 in the same way, perfect 😄

catalog/dags/data_augmentation/rekognition/constants.py Outdated Show resolved Hide resolved
# Templated variables for the DAG
TEMPLATE_S3_PREFIX = (
"{{ var.value.get('REKOGNITION_DATASET_PREFIX', '%s') }}" % S3_FILE_PREFIX # noqa: UP031
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really fantastic, I love how you've handled the variables per-environment and defaults here ✨

@AetherUnbound
Copy link
Collaborator Author

@stacimc I've included a total failed count and addressed the other issues you brought up - hopefully everything looks good now!

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! 🚢

@AetherUnbound AetherUnbound merged commit 4147944 into main Sep 4, 2024
45 checks passed
@AetherUnbound AetherUnbound deleted the feature/rekognition-ingestion-dag branch September 4, 2024 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: documentation Related to Sphinx documentation
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Create a DAG to generate and insert the new Rekognition tags
3 participants