-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Content Moderation/Trust and Safety: Initial user stories #362
Conversation
language to describe the subject. I (Sara) do not know if this is something that | ||
currently exists in Openverse, but it is something we could discuss with | ||
providers if we discovered it to ensure that we are capturing the relevant | ||
metadata. For example, some providers may include a note clarifying that a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I agree that we should show the provider's notes.
The implementation details can be complicated by the way the notes are displayed on the providers' sites or in their APIs and the way we ingest that data. I tried searching for some racial slurs on Openverse. One example I found is from Boston Public Library. We have ingested this item from their Flickr stream, which does not have any notices. However, the same item is also hosted on digitalcommonwealth.org, and their it has a notice banner at the top of the page: https://www.digitalcommonwealth.org/search/commonwealth:fq977w05r. I'm not sure if it's returned from the API, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great example of the kind of thing we should do. For sensitive textual content we could pretty reliably attach a notice of our own for items like that.
That specific one does not look unique to the item though, it seems like a generic one (like we could do just by scanning textual content for specific words). If there are GLAM providers that do have handwritten catalogue notes that are accessible to us in this vein we should include them though.
rfcs/trust-and-safety/20230109-trust_and_safety_preliminary_overview_and_exploration.md
Outdated
Show resolved
Hide resolved
- Assumptions: | ||
- Upon reingestion from a provider, the catalogue is able to note when results | ||
have been removed from upstream providers and eagerly remove them from the | ||
Openverse API without needing a full data refresh to occur |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that we want to remove all the items that were not returned from the provider API when reingesting? This might mean that we remove a lot of items if the provider changes the API or if the API errors out for any reason. However, re-checking each individual item that wasn't present during reingestion also requires time and resources.
I wonder if adding some sort of scanning process for items that are not present at the provider when re-ingesting would help. Or we could also consider adding an "antitboosting" parameter (is there a word for the action that is opposite of search rank boosting?) to all of such items, assuming that they were removed from the provider for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you're right. Even from a non-technical perspective, it's a difficult question. Expanding on your list, I can think of a bunch of reasons why a work might not appear in the provider API anymore, and there's almost no way for us to distinguish between them.
- The creator deletes their account or an individual validly licensed work. In this case, the works are still CC licensed (assuming they were correctly licensed to begin with, i.e., not stolen) and part of the commons and would ideally still be accessible. We've discussed this a couple of times in the past and mused about whether Openverse has a role in "preserving" the commons somehow (maybe by uploading things to Archive.org or something like that).
- DMCA takedown. We shouldn't distribute these.
- Sensitive material takedown. We shouldn't distribute these either probably assuming we agree with the provider. For example, the provider could be participating in government censorship that we don't care to be a part of, do we preserve these as in the first case then, assuming they're correctly licensed CC works?
- Illegal material takedown. We shouldn't distribute these.
- API errors (as you mentioned).
- API changes (also as you mentioned).
Each of these are different, and I don't think we could easily tell the difference between any of them in an automated or even manual way without heavy provider involvement.
One thing to note that I forgot about when writing this is that we will eventually stop serving those results in search because the links will be dead. They'll disappear from search after the cached success response expires (30 days from the first appearance in search). The thumbnail will continue to exist though, and I think the single result will as well.
Complicated issue. I don't have any concrete suggestions for this at the moment.
I wonder if adding some sort of scanning process for items that are not present at the provider when re-ingesting would help. Or we could also consider adding an "antitboosting" parameter (is there a word for the action that is opposite of search rank boosting?) to all of such items, assuming that they were removed from the provider for some reason.
Boosting them downward is an interesting idea. You might be able to apply negative or fractional boosting scores to documents in ES that would cause their scores to plummet. At that point though, I wonder if we could "soft delete" them by setting a flag that would just exclude them from search? Maybe we should just play it safe and exclude them from single results as well for now until we have clearer understandings of what the alternatives and implications of those alternatives would be?
Very tricky issue!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One discussed approach to this was to create {mediaType}-removed
tables in the Catalog DB in which we insert any records removed from the main tables. We could match the schema of the main tables but also add removed_on
and removed_reason
columns which would indicate the date of removal and the reason for removal. As discussed, we likely can't always find the specific reason why an image was removed/became unavailable from the provider (although perhaps some providers return different error codes in different situations, as an example, and we could leverage that) but this column would also be used for media we remove from Openverse (for various content safety and copyright reasons).
This would allow us to create a DAG or other mechanism in the future to re-crawl the items in the -removed
tables, either all of them or only ones with specific values in the removed_reason
columns (for example we could only re-crawl items that 503 errored during the data refresh).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great to me.
Co-authored-by: Olga Bulat <[email protected]>
Just leaving this here: it's an approach to programmatic cache clearing in CloudFlare: https://community.cloudflare.com/t/worker-recipe-cache-purge-proxy/29978 |
@WordPress/openverse What do folks want to do this with this PR? Shall I close it? Shall I merge it? Will things from the offline discussion y'all had be added to this? |
I think I'd like to have this merged even if it doesn't yet include feedback from our discussion! |
@sarayourfriend I'd like to undraft and merge these. They're great. We can have someone update them later after publishing the notes from the offline sync session we had on content safety. |
Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR: @krysal Excluding weekend1 days, this PR was updated 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2. @sarayourfriend, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was so thorough and really helpful for our discussion about content safety! User stories are a great way of approaching this kind of scoping and I'll try to use this approach in the future for other projects 😄
Description
This document is still in its preliminary stages and is meant as a place to collaborate on the list of user stories and derived technical and process requirements that we will need to embark on for trust and safety. Identifying the minimal list of requirements for content moderation and trust and safety can come after we've gotten a general view of the full scope of the initial set of needs.
Many Openverse sponsored contributors may be able to discuss these in person next week and collaborate on expanding this document. To participate otherwise, please leave any suggested user stories or technical and process requirements or assumptions that are missing for the list of user stories.
Checklist
Update index.md
).main
) ora parent feature branch.
errors.
Developer Certificate of Origin
Developer Certificate of Origin