-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Catalog data cleaning #345
Conversation
rfcs/20221209-data-cleaning.md
Outdated
After evaluating what needs to be done to get this rid of the duplicated cleaning step, I collected the relevant issues of the openverse-catalog into a new milestone, [Data cleaning unification][milestone], and determined a rough plan to solve them: | ||
|
||
1. Include the ingestion server cleaning steps in the ImageStore class | ||
- correct tags | ||
- correct URI protocol | ||
2. Create and run an image_cleaner workflow as described above | ||
3. Remove the cleanup step for images from the ingestion process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pulling the issues into the milestone and for writing the steps to the plan. Can you confirm whether the milestone's issues are ordered according to this plan? It's not clear to me from the issue titles which of them correspond to the relevant parts of the plan. Some of them seem easy enough to draw to a specific step of this plan, others less so,
Would it make sense to link the issues to the step described here to make that clearer? Additionally, do the issues in the milestone need to be re-written or re-phrased to match the intention of this RFC? Forgive me if I'm missing context that pulls them together, I may just be misunderstanding the language used in the issues. This one for example: https://github.com/WordPress/openverse-catalog/issues/510. Does "when loading ingested data" mean adding it to the ingestion server project as part of the data refresh? Or would it be changed to reference adding it to the ImageStore cleaning routine? If they need re-writing I guess that would happen after the RFC is approved?
Again, sorry if I'm misunderstanding something here and please let me know if I am.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issues have not been rewritten to reflect the intended changes. In the case of that issue you mentioned, it seems that was the goal at the moment of creation but adding more steps to the ingestion server is precisely what we would want to avoid here. So yes, I can re-phrase them for sure, if the plan makes sense for the team.
Also, I'm noticing it needs one especially for recovering the image_cleaner
DAG, a central piece! Aside from that, I don't think it requires a strict order aside from the one described here in the RFC. Issues related to validations need to get resolved before running the cleaner DAG, and removing the cleanup steps from the ingestion server should be the last one.
|
||
After evaluating what needs to be done to get this rid of the duplicated cleaning step, I collected the relevant issues of the openverse-catalog into a new milestone, [Data cleaning unification][milestone], and determined a rough plan to solve them: | ||
|
||
1. Include the ingestion server cleaning steps in the ImageStore class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These steps are already included in the ImageStore
class: first, the tags are cleaned in clean_media_metadata
method, and then, when saving the items to a TSV, the URL fields are cleaned in the Column
's prepare_strings
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean they are also happening in the AudioStore
class or do we need to pull that logic into the MediaStore
base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can see at the _enrich_tags
method (called from clean_media_metadata
), it's only adding the provider from wich the tag is added. We still need to clean duplicates and complete the deny/exclude list.
So the placement of the URL validation is interesting. I thought it would fit better into the ImageStore
class as we're concentrating there the validation. Do you think it should be moved early to the class or its current place is fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean they are also happening in the AudioStore class
As I understand it, the steps Olga mentions are already happening for both Audio and Image. clean_media_metadata
is defined on the MediaStore
and called by both subclasses, and both classes use the UrlColumn
for their url fields (and thus get the url validation). Agreed that the duplication and denylist/accuracy threshold checks would need to be added, though.
So the placement of the URL validation is interesting. I thought it would fit better into the ImageStore class as we're concentrating there the validation.
I don't mind the url validation where it is, but you could certainly make a case that it should be moved -- but I would say to clean_media_metadata
in the MediaStore
, to make sure it continues to be used by Audio as well. In the past I've definitely taken a bit longer to find where that validation is happening, since it's happening in a different place!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean they are also happening in the AudioStore class or do we need to pull that logic into the MediaStore base class?
Sorry for being unclear: the meta_data clean up steps were extracted to the MediaStore when we added Audio. All of the Audio data fields also go through the clean up step in the Columns' prepare_string
methods. So, in short, audio does not need to be cleaned. Except for one piece I'll mention below: the duplicate tags.
_enrich_tags
method does remove the tags that are on denylist:
https://github.com/WordPress/openverse-catalog/blob/5933f712d2d017eb1d952c68fffc6f1606d58eb1/openverse_catalog/dags/common/storage/media.py#L267
But it doesn't remove the duplicate tags! So, all of the tags need to be de-duplicated. And we actually need to add tag de-duplication to the _enrich_tags
method in MediaStore
.
For the placement of URL validation: I like the clean separation of concerns where the column is responsible for validating and creating SQL insert/update strings. Would adding documentation to the MediaStore
class about where the validation for various columns is happening help in discoverability? I think moving the validation to ImageStore
would mean a lot more code duplication because we would be calling validate
methods in both ImageStore and AudioStore, for all of the URL fields that we have.
I can see that it's unclear that a method called _create_tsv_row
actually also performs column validation. I wonder if renaming it and adding a clearer docstrings would help?
I've been thinking over how best to implement this project for a long time, and here are some thoughts I had about it: ConstraintsWhen working with the upstream database, we have the following constraints:
Prior workFortunately, all the data collected after the introduction of the ImageStore class is cleaned up. The tags are cleaned up in the clean_media_metadata method: Each media property has its own column type in the MediaStore. In the _create_tsv_row method, each value is cleaned using the appropriate column type's prepare_strings method. For instance, the URLColumn uses urls.validate_url_string method to add https, if possible, and to make the links absolute. There was a cleaner DAG created at CC that would select some rows in the upstream table, clean them up, and replace the upstream data with the cleaned data. This should work if we decide to use it, but I suspect it would take a long time due to the constraints. ProposalWe can update the "cleaner DAG" to make the process of data cleanup faster by changing it in several ways:
Note:Due to the columnar structure of parquet file, I also had an idea of a possibly more efficient process of updating specific columns and not going through the data row by row. For example, if we could only look into the URL columns and run the URL cleanup on them. Or, read the 2 columns of license and license_version and write the 3rd column with the license_url. However, I learned that you cannot update the parquet files, only write new ones. And with 600 000 000 rows, it might be a very expensive TSV file to store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad to see we have an existing implementation to lean on for going back and updating all records! I would be curious to know if there was any information we could find on how long that workflow would take to run (and if it was ran at all).
Since we already have the issues and the milestone, are we planning on having a technical implementation plan for this project as well? Perhaps these questions might be better suited there if so, but I'll ask them here too:
- Do you think we'll still be able to run ingestion of new records while the cleanup of existing data is happening? I suspect so since those new records should already be compliant, but I wanted to confirm with your thoughts.
- Similarly, would we want to stop the data refresh process while this is running? Again, I think we could leave it on and this could all happen simultaneously, but it could impact the performance of the data refresh. Given that I don't have a great sense of how long the cleaning could take, I'd hate to pause data refreshes for several weeks/months while that happens.
- With data migrations this large, I worry about what testing them might look like. Would we perhaps want to grab some sample data from each of the older providers for running small-scale tests on top of our unit tests? Could we do a data refresh of that small subset before and after the cleaning has been run on it to verify that none of the ingestion server cleaning steps actually needed to run? Unit tests will be helpful here but if we could also have one or several different data consistency/integration tests as well, that would be fantastic.
Thank you for codifying this all @krysal!
|
||
After evaluating what needs to be done to get this rid of the duplicated cleaning step, I collected the relevant issues of the openverse-catalog into a new milestone, [Data cleaning unification][milestone], and determined a rough plan to solve them: | ||
|
||
1. Include the ingestion server cleaning steps in the ImageStore class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean they are also happening in the AudioStore
class or do we need to pull that logic into the MediaStore
base class?
@krysal could you add a due date to the PR description? my recommendation would be two more weeks, at most, but up to you. I just want to make sure reviewers have clear expectations on when they need to comment. Thanks! |
@obulat Thank you for sharing your thoughts. Replying to some points:
It should be noted that structural changes of DB tables are not included here (adding or modifying columns). I raised the flag for those on the Make Openverse Blog but I'm not planning to include migrations on this project.
That is an interesting idea I haven't considered. Do you mean the inherited parquet files we have? I'm not aware of which part of the data those files cover but would be interesting to try how much it improves the performance of the cleaning process.
Agree 💯
This sounds like the most appealing part of the parquet files. I, admittedly, don't have much experience with columnar storage aside from theoretical knowledge, so we could create an issue to research this option and do some experiments.
Presumably, it is only the data ingested prior to the inclusion of the
We were informed that the DAG probably never ran, so I don't have information on much it would take (surely a considerable time). Regarding whether the ingestion and refreshing processes will need to stop, it depends on the strategy the cleaning DAG will use, if updates can be performed on the same table then I agree it may not need to be stopped, but @obulat notes suggest it will probably need to do a table switch, similar to that performed by the ingestion process, so in that case, we'd need to stop the DAGs in order not to lose info.
This is the point where it becomes more noticeable that the heavy work of the project will concentrate on optimizing the cleaner DAG. I'd also like to be able to do it in batches, this sounds possible by making use of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for writing this up! This is a very clear plan.
I'm really interested in getting a better understanding of roughly how many records we actually need to clean, given it should only apply to records ingested before a certain date -- and as I understand it, not reingested thereafter. As @obulat points out, a select for rows with invalid columns would likely be prohibitively time consuming, but it would be amazing if we could find some efficient ways to dramatically reduce the amount of records we need to process 🤔
I'll echo concerns from @AetherUnbound about testing. Should there be an additional technical investigation into optimization strategies for the DAG, and maybe testing concerns? This is another scenario where I wish we had a staging environment for the Catalog 😟
Could we consider having the cleanup DAG take a provider name as configuration, and run cleanup for a single provider at a time as a way to minimize risk?
rfcs/20221209-data-cleaning.md
Outdated
@@ -0,0 +1,26 @@ | |||
# RFC: Cleaning up the upstream database | |||
|
|||
One of the steps of the [data refresh process for images][data-refresh] is cleaning data that is not fit for production. This process runs weekly as Airflow DAGs, the cleaned data is only saved to the API database (DB), which is replaced when the data refresh finishes, so it needs to be cleaned every time. This cleaning step takes up to 16 hours lately. It has been stable in duration given the strategy changed to perform the validations and transformations of the data upfront, when pulling from providers rather than when copying the upstream catalog DB into the API DB, but images ingested prior to this change remained untouched. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
images ingested prior to this change remained untouched.
Is this something we can account for efficiently in the cleanup DAG, to reduce the number of images that need to be cleaned?
Because many of our DAGs are not dated, and therefore re-ingest all data on each run (using the MediaStore even if they did not when originally ingested), does this mean that all records for those providers should be okay? Likewise, for our dated DAGs do we only need to be concerned about records that have not recently been reingested?
The only other exception I can think of is records which were ingested before the introduction of the MediaStores, then deleted from the provider (which is not caught by our DAGs). I also know that we have some records not associated to provider DAGs, which will definitely need to be cleaned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you confirm that the URLs are replaced with the validated URLs, and the tags are replaced and not simply added to during the re-ingestion, @stacimc? Tags use a JSON Array Column that uses the following strategy when upserting:
I can't really understand what exactly is happening here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
URLs (and most other data) are overwritten during ingestion, but you're right @obulat that tags are only added to. Confirmed locally:
- Ingest a record A with tag names "foo" and "bar"
- The external provider updates the record to delete/update tags, so it now has only tag name "FOO"
- Re-ingest record A
- The record in the Catalog will now contain tag names "foo", "bar", and "FOO"
I do think that feels like an additional issue -- maybe one that should be handled here? @krysal what do you think? For non-dated DAGs, if we end up changing that upsert strategy we'd either need to (a) do something like the data cleanup DAG again, or (b) commit to letting the reingestion workflows slowly update them.
That being said, as far as the steps in the data cleanup go I think the point stands -- for non-dated DAGs, they should be updated during regular ingestion once the additional steps are added to the MediaStore. Does that make sense or do you see other edge cases?
Still the only exception I see is dead records that have been deleted from the original provider but are still in our catalog. It would be nice to get a better picture of how many of those we have.
|
||
After evaluating what needs to be done to get this rid of the duplicated cleaning step, I collected the relevant issues of the openverse-catalog into a new milestone, [Data cleaning unification][milestone], and determined a rough plan to solve them: | ||
|
||
1. Include the ingestion server cleaning steps in the ImageStore class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean they are also happening in the AudioStore class
As I understand it, the steps Olga mentions are already happening for both Audio and Image. clean_media_metadata
is defined on the MediaStore
and called by both subclasses, and both classes use the UrlColumn
for their url fields (and thus get the url validation). Agreed that the duplication and denylist/accuracy threshold checks would need to be added, though.
So the placement of the URL validation is interesting. I thought it would fit better into the ImageStore class as we're concentrating there the validation.
I don't mind the url validation where it is, but you could certainly make a case that it should be moved -- but I would say to clean_media_metadata
in the MediaStore
, to make sure it continues to be used by Audio as well. In the past I've definitely taken a bit longer to find where that validation is happening, since it's happening in a different place!
I think we need at least 3 more weeks considering that the next two weeks will be AFK for many contributors. |
(Older) Flickr data often has problems with URLs, and selecting a single provider does not help when it has [checking the latest numbers] 468 372 078 (!) records. |
I understand the wish to minimize the amount of work in a single project to make it achievable. My reservation with this is that if we decide not to add or remove columns, we will have to repeat a very costly operation in the future. With When running an |
Our upstream database is backed up to parquet files regularly, and I thought we could use the backup files for this... |
100%, it definitely does not help speed up Flickr. I'm mainly suggesting this in response to concerns about testing -- maybe we can run the cleanup on a single (much, much smaller) provider first.
+1. I think we need that amount of time, given AFK and the amount of open questions about the alternatives @obulat has suggested using the backup parquet files 😮 It's a really interesting idea and I think the point about potentially needing to repeat a costly operation for Regarding that approach: my assumption is that ingestion/data refresh would need to be halted during that process, which would be a significant con. Do you think that's the case @obulat or do you see a way around that? |
This seems like a great idea, to test on a tiny provider. It wouldn't be perfect but we could use that to make estimates about how long larger providers might take. Something that could work similarly is to also have a max number of records to process, so we could try a config of like |
These are great ideas and important concerns raised here. Thank you all for the valuable comments 😄 It seems that this requires some investigation on how many rows need cleaning and more details on the approach the |
Co-authored-by: Madison Swain-Bowden <[email protected]>
e2982e7
to
d65fd12
Compare
Closing this as has been stale while we discuss the 2023 planning and the "Asynchronous consent decision-making" for Openverse. The conversation can be reopened or reframed in a new proposal later using the new guidelines. Not to say I'm letting go of this job. The next steps are:
|
Description
This outline the rationale and a rough plan to perform a cleanup of the Catalog's database once and for all. Please leave your comments!
Reviewers
Developer Certificate of Origin
Developer Certificate of Origin