Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Add DAG to report reported media pending review #513

Merged
merged 10 commits into from
Jun 6, 2022

Conversation

stacimc
Copy link
Contributor

@stacimc stacimc commented May 18, 2022

Fixes

Fixes WordPress/openverse#1659

Description

Adds a report_pending_reported_images DAG that runs weekly and for each media type:

  • checks to see if there are any reported records in the pending_review status, meaning they require manual review through the Django Admin
  • reports the number of distinct records needing review to Slack, as well as a breakdown of the reasons for reporting (dmca, mature, other) and links to the admin page for that media

Screen Shot 2022-05-20 at 3 43 07 PM

Additional Notes

  • I verified that if a single image has multiple reports in pending_review associated with it, when you update any of these records all the other records associated to that image are updated as well. (This is why I think it's best to report distinct images requiring review)
  • When new media types are added, this will require you to add the name of the reports table in the configuration.

Testing Instructions

This requires a connection to the API DB (rather than the Catalog), so you'll need to create the connection env variables. You can use the values set in env.template in this PR. Make sure you also have the API running locally. To test the slack messages you can create the Airflow variable slack_message_override and set it to true.

Run the report_pending_reported_media DAG locally. If you're running sample data, there should be no reports and you should see the No records to report message.

To create sample data to test the report count, you can report some images and audio (locally!) through the API. I went to localhost:8000/v1/images and grabbed the id for a couple of the listed images. For each I then navigated to localhost:8000/v1/<id>/report and POSTed a report. You can verify that these went through by checking out http://localhost:8000/admin/api/imagereport/? Likewise for Audio. Report a mix of different reasons (some 'mature', some 'other', etc).

Run the DAG again and make sure the correct numbers are reported. I also played around with manually reviewing some of the media reports (updating their status to something other than pending_review in the Admin) and verifying they were no longer included in the count; and also reporting the same records multiple times and verifying that it is only counted once in the output.

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@stacimc stacimc added 🟩 priority: low Low priority and doesn't need to be rushed 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository 🐍 tech: python Requires familiarity with Python 🔧 tech: airflow Requires familiarity with Apache Airflow labels May 18, 2022
@stacimc stacimc self-assigned this May 18, 2022
@stacimc stacimc changed the title Add DAG to report image reports pending review Add DAG to report reported media pending review May 19, 2022
@stacimc stacimc marked this pull request as ready for review May 20, 2022 22:52
@stacimc stacimc requested a review from a team as a code owner May 20, 2022 22:52
Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works great! And will be really useful for all of us.

Copy link
Contributor

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM and is undoubtedly necessary for us to have.

However, what is actionable about this information? Do we have documentation about how to handle these reports or guidelines on expectations (like SLAs)?

Until we have those things I don't think they should go into the alerts channel, which should ideally only include actionable information. That can either mean not enabling the DAG until we've developed those processes or temporarily putting them into our notifications channel until they're actually actionable.

Comment on lines +91 to +92
report_counts_by_media_type: dict containing report counts per media type, per
report reason
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like something went wonky on the formatting for this comment.

Suggested change
report_counts_by_media_type: dict containing report counts per media type, per
report reason
report_counts_by_media_type: dict containing report counts per media type, per
report reason

is this what it's meant to be, maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did mean it to look like this -- generally when the description goes onto multiple lines, maintaining the alignment makes it more readable (for me at least). I'd say it's pretty necessary if there are multiple arguments (example), although maybe it looks strange when there's only one 🤷‍♀️

tests/dags/database/test_report_pending_reported_media.py Outdated Show resolved Hide resolved
Co-authored-by: sarayourfriend <[email protected]>
Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! I can't wait!

logger = logging.getLogger(__name__)

DAG_ID = "report_pending_reported_media"
DB_CONN_ID = os.getenv("OPENLEDGER_API_CONN_ID", "postgres_openledger_api")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I made an issue to tackle naming conventions RE: openledger #535

DB_CONN_ID = os.getenv("OPENLEDGER_API_CONN_ID", "postgres_openledger_api")

REPORTS_TABLES = {
"image": "nsfw_reports",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I made an issue to normalize these table names so we can generate this as f"nsfw_reports_{media_type}" in the future WordPress/openverse-api#719

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 Thanks for creating an issue!

Comment on lines +95 to +99
slack.send_message(
"No records require review at this time :tada:",
username="Reported Media Check-In",
)
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to prevent alert/notification fatigue, I think that it might be best not to send a slack message in this case 🙂 We can just emit a log message and exit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it feels useful to have a weekly peace-of-mind status report regardless -- this just eliminates the situation where you feel like you haven't heard from Reported Media in awhile and must go manually check DAG logs. I do understand the concern about notification fatigue, but this would be once weekly in the crop of notifications that come in over the weekend.

I don't feel especially strongly about this though! It seems likely that these notifications may change as we update our processes for dealing with reported media at any rate 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair! You're right, once weekly isn't too frequent 😄

slack.send_alert(message, username="Reported Media Requires Review")


def create_dag():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This totally isn't necessary since it's already written, but small DAGs like this would be a perfect trial for the TaskFlow API!. I'm not sure how dynamic DAGs would work in it though... 🤔 maybe just a loop calling the airflow.decorators.dag decorator.

@stacimc
Copy link
Contributor Author

stacimc commented May 31, 2022

However, what is actionable about this information? Do we have documentation about how to handle these reports or guidelines on expectations (like SLAs)?

Until we have those things I don't think they should go into the alerts channel, which should ideally only include actionable information. That can either mean not enabling the DAG until we've developed those processes or temporarily putting them into our notifications channel until they're actually actionable.

These are really important questions and to my knowledge we don't have the answers right now. It's possible that depending on what processes we introduce, the alerts may even need to be directed to an entirely different channel eventually. For the moment I think there's value to at least gaining insight into the problem as a first step.

@sarayourfriend
Copy link
Contributor

These are really important questions and to my knowledge we don't have the answers right now. It's possible that depending on what processes we introduce, the alerts may even need to be directed to an entirely different channel eventually. For the moment I think there's value to at least gaining insight into the problem as a first step.

Would it be distracting to redirect them into the notifications channel for the time being then? It'd be nice if things in the alerts channel were things that needed immediate attention and had clear actionability, like Sentry issue traiging or failed DAG runs (although I'm not sure if failed DAG runs have a clear action documented 🙂 if they do it'd be nice to link to the document describing what to do in the alerts for them).

@stacimc
Copy link
Contributor Author

stacimc commented Jun 1, 2022

Would it be distracting to redirect them into the notifications channel for the time being then? It'd be nice if things in the alerts channel were things that needed immediate attention and had clear actionability

I see your point, but it doesn't feel quite right to me to move this to notifications. Reported media is a problem and does require attention, and while the processes/SLAs for acting on them aren't clear, just moving it out of the channel seems like it would increase the likelihood of it getting lost and unaddressed. The notifications channel is also much more busy than alerts.

I think I may have the unpopular opinion here 😄 I can remove the notification for when no reports are found, and move the other one to the notifications channel. We would just need to be very deliberate about following up. Alternatively we can just wait on this PR until the process is solidified, since it's certain to change.

@sarayourfriend
Copy link
Contributor

Reported media is a problem and does require attention, and while the processes/SLAs for acting on them aren't clear, just moving it out of the channel seems like it would increase the likelihood of it getting lost and unaddressed.

This does make it sound like there's some action you're expecting to take when these come through 🙂 Even if the actions today are temporary "we're still figuring out what this means/documenting it/etc" it'd be nice to write that down and link to it from the notification so anyone is immediately able to either help progress that forward or know that they can ignore it and someone else who is keeping tabs on the particular project will come around to it after them.

@zackkrida
Copy link
Member

I personally feel like the particularities of the channels and response to these notifications is an implementation detail we can handle outside of the repository. To that end, I've started a dialog with WP Photo directory folks about potentially sharing moderators and moderator policies. Until then I can personally handle these reports weekly.

@stacimc
Copy link
Contributor Author

stacimc commented Jun 6, 2022

I created WordPress/openverse#1541 to track writing a runbook for handling reported media, and adjusting Slack notifications as necessary.

@stacimc stacimc merged commit 9098082 into main Jun 6, 2022
@stacimc stacimc deleted the feature/image-report-report-dag branch June 6, 2022 18:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟩 priority: low Low priority and doesn't need to be rushed 🔧 tech: airflow Requires familiarity with Apache Airflow 🐍 tech: python Requires familiarity with Python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Image report report DAG
5 participants