Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content Moderation/Trust and Safety: Initial user stories #362

Merged
merged 5 commits into from
Feb 6, 2023

Conversation

sarayourfriend
Copy link
Collaborator

Description

This document is still in its preliminary stages and is meant as a place to collaborate on the list of user stories and derived technical and process requirements that we will need to embark on for trust and safety. Identifying the minimal list of requirements for content moderation and trust and safety can come after we've gotten a general view of the full scope of the initial set of needs.

Many Openverse sponsored contributors may be able to discuss these in person next week and collaborate on expanding this document. To participate otherwise, please leave any suggested user stories or technical and process requirements or assumptions that are missing for the list of user stories.

Checklist

  • My pull request has a descriptive title (not a vague title like
    Update index.md).
  • My pull request targets the default branch of the repository (main) or
    a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • [N/A] I added or updated tests for the changes I made (if applicable).
  • [N/A] I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible
    errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@sarayourfriend sarayourfriend added 🟧 priority: high Stalls work on the project or its dependents 🌟 goal: addition Addition of new feature ♿️ aspect: a11y Concerns related to the project's accessibility 🔧 tech: django Involves Django 🔧 tech: airflow Involves Apache Airflow 💾 tech: postgres Involves PostgreSQL labels Jan 8, 2023
language to describe the subject. I (Sara) do not know if this is something that
currently exists in Openverse, but it is something we could discuss with
providers if we discovered it to ensure that we are capturing the relevant
metadata. For example, some providers may include a note clarifying that a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I agree that we should show the provider's notes.
The implementation details can be complicated by the way the notes are displayed on the providers' sites or in their APIs and the way we ingest that data. I tried searching for some racial slurs on Openverse. One example I found is from Boston Public Library. We have ingested this item from their Flickr stream, which does not have any notices. However, the same item is also hosted on digitalcommonwealth.org, and their it has a notice banner at the top of the page: https://www.digitalcommonwealth.org/search/commonwealth:fq977w05r. I'm not sure if it's returned from the API, though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great example of the kind of thing we should do. For sensitive textual content we could pretty reliably attach a notice of our own for items like that.

That specific one does not look unique to the item though, it seems like a generic one (like we could do just by scanning textual content for specific words). If there are GLAM providers that do have handwritten catalogue notes that are accessible to us in this vein we should include them though.

- Assumptions:
- Upon reingestion from a provider, the catalogue is able to note when results
have been removed from upstream providers and eagerly remove them from the
Openverse API without needing a full data refresh to occur
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we want to remove all the items that were not returned from the provider API when reingesting? This might mean that we remove a lot of items if the provider changes the API or if the API errors out for any reason. However, re-checking each individual item that wasn't present during reingestion also requires time and resources.

I wonder if adding some sort of scanning process for items that are not present at the provider when re-ingesting would help. Or we could also consider adding an "antitboosting" parameter (is there a word for the action that is opposite of search rank boosting?) to all of such items, assuming that they were removed from the provider for some reason.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're right. Even from a non-technical perspective, it's a difficult question. Expanding on your list, I can think of a bunch of reasons why a work might not appear in the provider API anymore, and there's almost no way for us to distinguish between them.

  1. The creator deletes their account or an individual validly licensed work. In this case, the works are still CC licensed (assuming they were correctly licensed to begin with, i.e., not stolen) and part of the commons and would ideally still be accessible. We've discussed this a couple of times in the past and mused about whether Openverse has a role in "preserving" the commons somehow (maybe by uploading things to Archive.org or something like that).
  2. DMCA takedown. We shouldn't distribute these.
  3. Sensitive material takedown. We shouldn't distribute these either probably assuming we agree with the provider. For example, the provider could be participating in government censorship that we don't care to be a part of, do we preserve these as in the first case then, assuming they're correctly licensed CC works?
  4. Illegal material takedown. We shouldn't distribute these.
  5. API errors (as you mentioned).
  6. API changes (also as you mentioned).

Each of these are different, and I don't think we could easily tell the difference between any of them in an automated or even manual way without heavy provider involvement.

One thing to note that I forgot about when writing this is that we will eventually stop serving those results in search because the links will be dead. They'll disappear from search after the cached success response expires (30 days from the first appearance in search). The thumbnail will continue to exist though, and I think the single result will as well.

Complicated issue. I don't have any concrete suggestions for this at the moment.

I wonder if adding some sort of scanning process for items that are not present at the provider when re-ingesting would help. Or we could also consider adding an "antitboosting" parameter (is there a word for the action that is opposite of search rank boosting?) to all of such items, assuming that they were removed from the provider for some reason.

Boosting them downward is an interesting idea. You might be able to apply negative or fractional boosting scores to documents in ES that would cause their scores to plummet. At that point though, I wonder if we could "soft delete" them by setting a flag that would just exclude them from search? Maybe we should just play it safe and exclude them from single results as well for now until we have clearer understandings of what the alternatives and implications of those alternatives would be?

Very tricky issue!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One discussed approach to this was to create {mediaType}-removed tables in the Catalog DB in which we insert any records removed from the main tables. We could match the schema of the main tables but also add removed_on and removed_reason columns which would indicate the date of removal and the reason for removal. As discussed, we likely can't always find the specific reason why an image was removed/became unavailable from the provider (although perhaps some providers return different error codes in different situations, as an example, and we could leverage that) but this column would also be used for media we remove from Openverse (for various content safety and copyright reasons).

This would allow us to create a DAG or other mechanism in the future to re-crawl the items in the -removed tables, either all of them or only ones with specific values in the removed_reason columns (for example we could only re-crawl items that 503 errored during the data refresh).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great to me.

Co-authored-by: Olga Bulat <[email protected]>
@zackkrida
Copy link
Member

Just leaving this here: it's an approach to programmatic cache clearing in CloudFlare:

https://community.cloudflare.com/t/worker-recipe-cache-purge-proxy/29978

@sarayourfriend
Copy link
Collaborator Author

@WordPress/openverse What do folks want to do this with this PR? Shall I close it? Shall I merge it? Will things from the offline discussion y'all had be added to this?

@AetherUnbound
Copy link
Collaborator

AetherUnbound commented Jan 30, 2023

I think I'd like to have this merged even if it doesn't yet include feedback from our discussion!

@zackkrida
Copy link
Member

@sarayourfriend I'd like to undraft and merge these. They're great. We can have someone update them later after publishing the notes from the offline sync session we had on content safety.

@sarayourfriend sarayourfriend marked this pull request as ready for review January 30, 2023 20:42
@sarayourfriend sarayourfriend requested a review from a team as a code owner January 30, 2023 20:42
@openverse-bot
Copy link
Collaborator

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@krysal
@AetherUnbound
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was updated 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2.

@sarayourfriend, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was so thorough and really helpful for our discussion about content safety! User stories are a great way of approaching this kind of scoping and I'll try to use this approach in the future for other projects 😄

@zackkrida zackkrida merged commit 669ebd7 into main Feb 6, 2023
@zackkrida zackkrida deleted the trust-and-safety/project-planning branch February 6, 2023 19:21
dhruvkb pushed a commit that referenced this pull request Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
♿️ aspect: a11y Concerns related to the project's accessibility 🌟 goal: addition Addition of new feature 🟧 priority: high Stalls work on the project or its dependents 🔧 tech: airflow Involves Apache Airflow 🔧 tech: django Involves Django 💾 tech: postgres Involves PostgreSQL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants