RFC: Catalog data cleaning #345

krysal · 2022-12-09T21:47:24Z

Description

This outline the rationale and a rough plan to perform a cleanup of the Catalog's database once and for all. Please leave your comments!

Reviewers

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

sarayourfriend · 2022-12-12T05:13:20Z

rfcs/20221209-data-cleaning.md

+After evaluating what needs to be done to get this rid of the duplicated cleaning step, I collected the relevant issues of the openverse-catalog into a new milestone, [Data cleaning unification][milestone], and determined a rough plan to solve them:
+
+    1. Include the ingestion server cleaning steps in the ImageStore class
+        - correct tags
+        - correct URI protocol
+    2. Create and run an image_cleaner workflow as described above
+    3. Remove the cleanup step for images from the ingestion process


Thanks for pulling the issues into the milestone and for writing the steps to the plan. Can you confirm whether the milestone's issues are ordered according to this plan? It's not clear to me from the issue titles which of them correspond to the relevant parts of the plan. Some of them seem easy enough to draw to a specific step of this plan, others less so,

Would it make sense to link the issues to the step described here to make that clearer? Additionally, do the issues in the milestone need to be re-written or re-phrased to match the intention of this RFC? Forgive me if I'm missing context that pulls them together, I may just be misunderstanding the language used in the issues. This one for example: https://github.com/WordPress/openverse-catalog/issues/510. Does "when loading ingested data" mean adding it to the ingestion server project as part of the data refresh? Or would it be changed to reference adding it to the ImageStore cleaning routine? If they need re-writing I guess that would happen after the RFC is approved?

Again, sorry if I'm misunderstanding something here and please let me know if I am.

The issues have not been rewritten to reflect the intended changes. In the case of that issue you mentioned, it seems that was the goal at the moment of creation but adding more steps to the ingestion server is precisely what we would want to avoid here. So yes, I can re-phrase them for sure, if the plan makes sense for the team.

Also, I'm noticing it needs one especially for recovering the image_cleaner DAG, a central piece! Aside from that, I don't think it requires a strict order aside from the one described here in the RFC. Issues related to validations need to get resolved before running the cleaner DAG, and removing the cleanup steps from the ingestion server should be the last one.

obulat · 2022-12-12T14:28:45Z

rfcs/20221209-data-cleaning.md

+
+After evaluating what needs to be done to get this rid of the duplicated cleaning step, I collected the relevant issues of the openverse-catalog into a new milestone, [Data cleaning unification][milestone], and determined a rough plan to solve them:
+
+    1. Include the ingestion server cleaning steps in the ImageStore class


These steps are already included in the ImageStore class: first, the tags are cleaned in clean_media_metadata method, and then, when saving the items to a TSV, the URL fields are cleaned in the Column's prepare_strings method.

Does that mean they are also happening in the AudioStore class or do we need to pull that logic into the MediaStore base class?

From what I can see at the _enrich_tags method (called from clean_media_metadata), it's only adding the provider from wich the tag is added. We still need to clean duplicates and complete the deny/exclude list.

So the placement of the URL validation is interesting. I thought it would fit better into the ImageStore class as we're concentrating there the validation. Do you think it should be moved early to the class or its current place is fine?

Does that mean they are also happening in the AudioStore class

As I understand it, the steps Olga mentions are already happening for both Audio and Image. clean_media_metadata is defined on the MediaStore and called by both subclasses, and both classes use the UrlColumn for their url fields (and thus get the url validation). Agreed that the duplication and denylist/accuracy threshold checks would need to be added, though.

So the placement of the URL validation is interesting. I thought it would fit better into the ImageStore class as we're concentrating there the validation.

I don't mind the url validation where it is, but you could certainly make a case that it should be moved -- but I would say to clean_media_metadata in the MediaStore, to make sure it continues to be used by Audio as well. In the past I've definitely taken a bit longer to find where that validation is happening, since it's happening in a different place!

Does that mean they are also happening in the AudioStore class or do we need to pull that logic into the MediaStore base class?

Sorry for being unclear: the meta_data clean up steps were extracted to the MediaStore when we added Audio. All of the Audio data fields also go through the clean up step in the Columns' prepare_string methods. So, in short, audio does not need to be cleaned. Except for one piece I'll mention below: the duplicate tags.

_enrich_tags method does remove the tags that are on denylist:
https://github.com/WordPress/openverse-catalog/blob/5933f712d2d017eb1d952c68fffc6f1606d58eb1/openverse_catalog/dags/common/storage/media.py#L267

But it doesn't remove the duplicate tags! So, all of the tags need to be de-duplicated. And we actually need to add tag de-duplication to the _enrich_tags method in MediaStore.

For the placement of URL validation: I like the clean separation of concerns where the column is responsible for validating and creating SQL insert/update strings. Would adding documentation to the MediaStore class about where the validation for various columns is happening help in discoverability? I think moving the validation to ImageStore would mean a lot more code duplication because we would be calling validate methods in both ImageStore and AudioStore, for all of the URL fields that we have.

I can see that it's unclear that a method called _create_tsv_row actually also performs column validation. I wonder if renaming it and adding a clearer docstrings would help?

obulat · 2022-12-12T14:31:31Z

I've been thinking over how best to implement this project for a long time, and here are some thoughts I had about it:

Constraints

When working with the upstream database, we have the following constraints:

There is a lot of data, more than 600 000 000 items in one table.
We have some indexes on the table. If we try to simply update all rows with, for example, a new license_url column, it will take a long time due to how Postgres works. When we run UPDATE on the table, Postgres creates a new row version and will need to update the index to point to that new row. There is a way of improving the UPDATE performance by using HOT updates, but that requires extra disk space and some expertise for tuning it (more about this in SO: https://stackoverflow.com/questions/71618615/does-postgresql-update-index-even-when-the-updated-columns-arent-the-indexed-on).
At the same time, selecting from the table might be slow due to point 1 and not enough indexes. This means that we cannot simply "select all the rows that have defective URLs and update them".

Prior work

Fortunately, all the data collected after the introduction of the ImageStore class is cleaned up.
The ImageStore class was created to move the API clean-up steps to the ingestion process in a cleaner way. After we've added the Audio, the clean-up code moved to MediaStore class.

The tags are cleaned up in the clean_media_metadata method:
https://github.com/WordPress/openverse-catalog/blob/f3799fc02c189848082decae60afaca7d9fefa2c/openverse_catalog/dags/common/storage/media.py#L92

Each media property has its own column type in the MediaStore. In the _create_tsv_row method, each value is cleaned using the appropriate column type's prepare_strings method. For instance, the URLColumn uses urls.validate_url_string method to add https, if possible, and to make the links absolute.
https://github.com/WordPress/openverse-catalog/blob/f3799fc02c189848082decae60afaca7d9fefa2c/openverse_catalog/dags/common/storage/columns.py#L492

There was a cleaner DAG created at CC that would select some rows in the upstream table, clean them up, and replace the upstream data with the cleaned data. This should work if we decide to use it, but I suspect it would take a long time due to the constraints.

Proposal

We can update the "cleaner DAG" to make the process of data cleanup faster by changing it in several ways:

we create a new database without the indexes - this will make writing to the database very fast.
we can use the backup parquet files to read from. This will make the reads fast. We would need to take into account the columnar structure of parquet vs the row-based structure of Postgres and any changes in the row from a TSV row.
we would need to update the add_item method for any new columns we decide to add (like the license_url), or old columns we decide to remove.
We would need to make sure that we don't miss any rows, save any rows that fail during the clean up step (for any reason), clearly log the time passed and expected time, can cancel the process at any time and hopefully resume where it stopped. Another consideration is how to make sure that the data that is collected during the run of this "cleaner DAG" is also saved to the new database without omissions or duplication.
Since this process can be expensive in terms of storage costs, it is important to decide what we want to change in the database table structure beforehand.

Note:

Due to the columnar structure of parquet file, I also had an idea of a possibly more efficient process of updating specific columns and not going through the data row by row. For example, if we could only look into the URL columns and run the URL cleanup on them. Or, read the 2 columns of license and license_version and write the 3rd column with the license_url. However, I learned that you cannot update the parquet files, only write new ones. And with 600 000 000 rows, it might be a very expensive TSV file to store.

AetherUnbound

I'm glad to see we have an existing implementation to lean on for going back and updating all records! I would be curious to know if there was any information we could find on how long that workflow would take to run (and if it was ran at all).

Since we already have the issues and the milestone, are we planning on having a technical implementation plan for this project as well? Perhaps these questions might be better suited there if so, but I'll ask them here too:

Do you think we'll still be able to run ingestion of new records while the cleanup of existing data is happening? I suspect so since those new records should already be compliant, but I wanted to confirm with your thoughts.
Similarly, would we want to stop the data refresh process while this is running? Again, I think we could leave it on and this could all happen simultaneously, but it could impact the performance of the data refresh. Given that I don't have a great sense of how long the cleaning could take, I'd hate to pause data refreshes for several weeks/months while that happens.
With data migrations this large, I worry about what testing them might look like. Would we perhaps want to grab some sample data from each of the older providers for running small-scale tests on top of our unit tests? Could we do a data refresh of that small subset before and after the cleaning has been run on it to verify that none of the ingestion server cleaning steps actually needed to run? Unit tests will be helpful here but if we could also have one or several different data consistency/integration tests as well, that would be fantastic.

Thank you for codifying this all @krysal!

rfcs/20221209-data-cleaning.md

AetherUnbound · 2022-12-14T20:08:44Z

rfcs/20221209-data-cleaning.md

+
+After evaluating what needs to be done to get this rid of the duplicated cleaning step, I collected the relevant issues of the openverse-catalog into a new milestone, [Data cleaning unification][milestone], and determined a rough plan to solve them:
+
+    1. Include the ingestion server cleaning steps in the ImageStore class


Does that mean they are also happening in the AudioStore class or do we need to pull that logic into the MediaStore base class?

zackkrida · 2022-12-14T20:29:20Z

@krysal could you add a due date to the PR description? my recommendation would be two more weeks, at most, but up to you. I just want to make sure reviewers have clear expectations on when they need to comment. Thanks!

krysal · 2022-12-14T22:19:57Z

@obulat Thank you for sharing your thoughts. Replying to some points:

Constraints

...

We have some indexes on the table. If we try to simply update all rows with, for example, a new license_url column, it will take a long time due to how Postgres works...

It should be noted that structural changes of DB tables are not included here (adding or modifying columns). I raised the flag for those on the Make Openverse Blog but I'm not planning to include migrations on this project.

Proposal

...

we can use the backup parquet files to read from. This will make the reads fast. We would need to take into account the columnar structure of parquet vs the row-based structure of Postgres and any changes in the row from a TSV row.

That is an interesting idea I haven't considered. Do you mean the inherited parquet files we have? I'm not aware of which part of the data those files cover but would be interesting to try how much it improves the performance of the cleaning process.

We would need to make sure that we don't miss any rows, save any rows that fail during the clean up step (for any reason), clearly log the time passed and expected time, can cancel the process at any time and hopefully resume where it stopped. Another consideration is how to make sure that the data that is collected during the run of this "cleaner DAG" is also saved to the new database without omissions or duplication.

Agree 💯

Note:

Due to the columnar structure of parquet file, I also had an idea of a possibly more efficient process of updating specific columns and not going through the data row by row. For example, if we could only look into the URL columns and run the URL cleanup on them.

This sounds like the most appealing part of the parquet files. I, admittedly, don't have much experience with columnar storage aside from theoretical knowledge, so we could create an issue to research this option and do some experiments.

However, I learned that you cannot update the parquet files, only write new ones. And with 600 000 000 rows, it might be a very expensive TSV file to store.

Presumably, it is only the data ingested prior to the inclusion of the MediaStore classes that needs cleaning, so it shouldn't be the whole dataset either.

@AetherUnbound

I would be curious to know if there was any information we could find on how long that workflow would take to run (and if it was ran at all).

We were informed that the DAG probably never ran, so I don't have information on much it would take (surely a considerable time). Regarding whether the ingestion and refreshing processes will need to stop, it depends on the strategy the cleaning DAG will use, if updates can be performed on the same table then I agree it may not need to be stopped, but @obulat notes suggest it will probably need to do a table switch, similar to that performed by the ingestion process, so in that case, we'd need to stop the DAGs in order not to lose info.

Since we already have the issues and the milestone, are we planning on having a technical implementation plan for this project as well?

This is the point where it becomes more noticeable that the heavy work of the project will concentrate on optimizing the cleaner DAG. I'd also like to be able to do it in batches, this sounds possible by making use of the created_on column. I can go into more details about the DAG in the coming days but estimating the duration of its execution can be pretty difficult without any running before.

stacimc

Thank you for writing this up! This is a very clear plan.

I'm really interested in getting a better understanding of roughly how many records we actually need to clean, given it should only apply to records ingested before a certain date -- and as I understand it, not reingested thereafter. As @obulat points out, a select for rows with invalid columns would likely be prohibitively time consuming, but it would be amazing if we could find some efficient ways to dramatically reduce the amount of records we need to process 🤔

I'll echo concerns from @AetherUnbound about testing. Should there be an additional technical investigation into optimization strategies for the DAG, and maybe testing concerns? This is another scenario where I wish we had a staging environment for the Catalog 😟

Could we consider having the cleanup DAG take a provider name as configuration, and run cleanup for a single provider at a time as a way to minimize risk?

stacimc · 2022-12-14T23:04:10Z

rfcs/20221209-data-cleaning.md

@@ -0,0 +1,26 @@
+# RFC: Cleaning up the upstream database
+
+One of the steps of the [data refresh process for images][data-refresh] is cleaning data that is not fit for production. This process runs weekly as Airflow DAGs, the cleaned data is only saved to the API database (DB), which is replaced when the data refresh finishes, so it needs to be cleaned every time. This cleaning step takes up to 16 hours lately. It has been stable in duration given the strategy changed to perform the validations and transformations of the data upfront, when pulling from providers rather than when copying the upstream catalog DB into the API DB, but images ingested prior to this change remained untouched.


images ingested prior to this change remained untouched.

Is this something we can account for efficiently in the cleanup DAG, to reduce the number of images that need to be cleaned?

Because many of our DAGs are not dated, and therefore re-ingest all data on each run (using the MediaStore even if they did not when originally ingested), does this mean that all records for those providers should be okay? Likewise, for our dated DAGs do we only need to be concerned about records that have not recently been reingested?

The only other exception I can think of is records which were ingested before the introduction of the MediaStores, then deleted from the provider (which is not caught by our DAGs). I also know that we have some records not associated to provider DAGs, which will definitely need to be cleaned.

Can you confirm that the URLs are replaced with the validated URLs, and the tags are replaced and not simply added to during the re-ingestion, @stacimc? Tags use a JSON Array Column that uses the following strategy when upserting:

https://github.com/WordPress/openverse-catalog/blob/5933f712d2d017eb1d952c68fffc6f1606d58eb1/openverse_catalog/dags/common/storage/columns.py#L57-L65

I can't really understand what exactly is happening here :)

URLs (and most other data) are overwritten during ingestion, but you're right @obulat that tags are only added to. Confirmed locally:

Ingest a record A with tag names "foo" and "bar"

The external provider updates the record to delete/update tags, so it now has only tag name "FOO"

Re-ingest record A

The record in the Catalog will now contain tag names "foo", "bar", and "FOO"

I do think that feels like an additional issue -- maybe one that should be handled here? @krysal what do you think? For non-dated DAGs, if we end up changing that upsert strategy we'd either need to (a) do something like the data cleanup DAG again, or (b) commit to letting the reingestion workflows slowly update them.

That being said, as far as the steps in the data cleanup go I think the point stands -- for non-dated DAGs, they should be updated during regular ingestion once the additional steps are added to the MediaStore. Does that make sense or do you see other edge cases?

Still the only exception I see is dead records that have been deleted from the original provider but are still in our catalog. It would be nice to get a better picture of how many of those we have.

stacimc · 2022-12-14T23:26:58Z

rfcs/20221209-data-cleaning.md

+
+After evaluating what needs to be done to get this rid of the duplicated cleaning step, I collected the relevant issues of the openverse-catalog into a new milestone, [Data cleaning unification][milestone], and determined a rough plan to solve them:
+
+    1. Include the ingestion server cleaning steps in the ImageStore class


Does that mean they are also happening in the AudioStore class

As I understand it, the steps Olga mentions are already happening for both Audio and Image. clean_media_metadata is defined on the MediaStore and called by both subclasses, and both classes use the UrlColumn for their url fields (and thus get the url validation). Agreed that the duplication and denylist/accuracy threshold checks would need to be added, though.

So the placement of the URL validation is interesting. I thought it would fit better into the ImageStore class as we're concentrating there the validation.

I don't mind the url validation where it is, but you could certainly make a case that it should be moved -- but I would say to clean_media_metadata in the MediaStore, to make sure it continues to be used by Audio as well. In the past I've definitely taken a bit longer to find where that validation is happening, since it's happening in a different place!

obulat · 2022-12-15T06:34:47Z

@krysal could you add a due date to the PR description? my recommendation would be two more weeks, at most, but up to you. I just want to make sure reviewers have clear expectations on when they need to comment. Thanks!

I think we need at least 3 more weeks considering that the next two weeks will be AFK for many contributors.

obulat · 2022-12-15T06:45:54Z

Could we consider having the cleanup DAG take a provider name as configuration, and run cleanup for a single provider at a time as a way to minimize risk?

(Older) Flickr data often has problems with URLs, and selecting a single provider does not help when it has [checking the latest numbers] 468 372 078 (!) records.

obulat · 2022-12-15T06:52:28Z

It should be noted that structural changes of DB tables are not included here (adding or modifying columns). I raised the flag for those on the Make Openverse Blog but I'm not planning to include migrations on this project.

I understand the wish to minimize the amount of work in a single project to make it achievable. My reservation with this is that if we decide not to add or remove columns, we will have to repeat a very costly operation in the future.

With category column, it was easy to add it because we only added a blank column and did not back-fill. However, if we, for example, decide to add the license_url column in the future, and try to back-fill it using the data from license and license_version columns for all 600 million rows, it will take a really long time because all of the rows will need to be re-written due to the way Postgres handles UPDATE queries.

When running an UPDATE query, Postgres usually adds a new row with updated data and makes all of the indexes point to the new row in the table. This means a lot of work and extra space for every row we need to update.

obulat · 2022-12-15T06:54:27Z

That is an interesting idea I haven't considered. Do you mean the inherited parquet files we have? I'm not aware of which part of the data those files cover but would be interesting to try how much it improves the performance of the cleaning process.

Our upstream database is backed up to parquet files regularly, and I thought we could use the backup files for this...

stacimc · 2022-12-16T01:08:27Z

Could we consider having the cleanup DAG take a provider name as configuration, and run cleanup for a single provider at a time as a way to minimize risk?

(Older) Flickr data often has problems with URLs, and selecting a single provider does not help when it has [checking the latest numbers] 468 372 078 (!) records.

100%, it definitely does not help speed up Flickr. I'm mainly suggesting this in response to concerns about testing -- maybe we can run the cleanup on a single (much, much smaller) provider first.

I think we need at least 3 more weeks considering that the next two weeks will be AFK for many contributors.

+1. I think we need that amount of time, given AFK and the amount of open questions about the alternatives @obulat has suggested using the backup parquet files 😮 It's a really interesting idea and I think the point about potentially needing to repeat a costly operation for license_url is a great one, although I worry we'd need to pause and do a lot of work to identify all the changes we want to make, and ultimately there will likely be changes needed we can't foresee now.

Regarding that approach: my assumption is that ingestion/data refresh would need to be halted during that process, which would be a significant con. Do you think that's the case @obulat or do you see a way around that?

zackkrida · 2022-12-16T12:40:24Z

I'm mainly suggesting this in response to concerns about testing -- maybe we can run the cleanup on a single (much, much smaller) provider first.

This seems like a great idea, to test on a tiny provider. It wouldn't be perfect but we could use that to make estimates about how long larger providers might take.

Something that could work similarly is to also have a max number of records to process, so we could try a config of like {provider: 'flickr', limit: 50} and see how long it takes, then make a projection about the entire Flickr dataset.

krysal · 2022-12-16T21:14:26Z

These are great ideas and important concerns raised here. Thank you all for the valuable comments 😄 It seems that this requires some investigation on how many rows need cleaning and more details on the approach the cleaner_image DAG would need. I'll think about those tasks and, in the meantime, draft this until collecting more information.

Co-authored-by: Madison Swain-Bowden <[email protected]>

krysal · 2023-02-09T20:18:23Z

Closing this as has been stale while we discuss the 2023 planning and the "Asynchronous consent decision-making" for Openverse. The conversation can be reopened or reframed in a new proposal later using the new guidelines.

Not to say I'm letting go of this job. The next steps are:

Find an estimate of the number of rows that need to be cleaned to have an approximation of the duration of the cleaning process.
Optimization strategies for the DAG or the Data Refresh process

krysal requested a review from a team as a code owner December 9, 2022 21:47

krysal requested review from obulat and dhruvkb December 9, 2022 21:47

openverse-bot added the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Dec 9, 2022

sarayourfriend reviewed Dec 12, 2022

View reviewed changes

obulat reviewed Dec 12, 2022

View reviewed changes

AetherUnbound reviewed Dec 14, 2022

View reviewed changes

stacimc reviewed Dec 14, 2022

View reviewed changes

krysal mentioned this pull request Dec 15, 2022

Break load_from_s3 into separate tasks to fix duplicate reporting WordPress/openverse-catalog#914

Merged

7 tasks

sarayourfriend mentioned this pull request Dec 16, 2022

[Feature] Use metadata keywords to help detect if something is NSFW (original #482) #750

Closed

krysal marked this pull request as draft December 16, 2022 21:14

krysal and others added 4 commits December 16, 2022 17:16

Add Catalog data cleaning proposal

ee786e6

Rephrase for clarity

a82839e

Co-authored-by: Madison Swain-Bowden <[email protected]>

Format data cleaning rfc with prettier

1ef876f

Ignore venv folder

d65fd12

krysal force-pushed the data_cleaning_proposal branch from e2982e7 to d65fd12 Compare December 16, 2022 21:27

krysal closed this Feb 9, 2023

krysal deleted the data_cleaning_proposal branch March 23, 2023 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Catalog data cleaning #345

RFC: Catalog data cleaning #345

krysal commented Dec 9, 2022

sarayourfriend Dec 12, 2022

krysal Dec 14, 2022

obulat Dec 12, 2022

AetherUnbound Dec 14, 2022

krysal Dec 14, 2022

stacimc Dec 14, 2022

obulat Dec 15, 2022

obulat commented Dec 12, 2022

AetherUnbound left a comment

AetherUnbound Dec 14, 2022

zackkrida commented Dec 14, 2022

krysal commented Dec 14, 2022 •

edited

Loading

Constraints

Proposal

Note:

stacimc left a comment

stacimc Dec 14, 2022

obulat Dec 15, 2022

stacimc Dec 16, 2022

stacimc Dec 14, 2022

obulat commented Dec 15, 2022

obulat commented Dec 15, 2022

obulat commented Dec 15, 2022

obulat commented Dec 15, 2022

stacimc commented Dec 16, 2022 •

edited

Loading

zackkrida commented Dec 16, 2022

krysal commented Dec 16, 2022

krysal commented Feb 9, 2023


		After evaluating what needs to be done to get this rid of the duplicated cleaning step, I collected the relevant issues of the openverse-catalog into a new milestone, [Data cleaning unification][milestone], and determined a rough plan to solve them:

		1. Include the ingestion server cleaning steps in the ImageStore class

		@@ -0,0 +1,26 @@
		# RFC: Cleaning up the upstream database

		One of the steps of the [data refresh process for images][data-refresh] is cleaning data that is not fit for production. This process runs weekly as Airflow DAGs, the cleaned data is only saved to the API database (DB), which is replaced when the data refresh finishes, so it needs to be cleaned every time. This cleaning step takes up to 16 hours lately. It has been stable in duration given the strategy changed to perform the validations and transformations of the data upfront, when pulling from providers rather than when copying the upstream catalog DB into the API DB, but images ingested prior to this change remained untouched.

RFC: Catalog data cleaning #345

RFC: Catalog data cleaning #345

Conversation

krysal commented Dec 9, 2022

Description

Reviewers

Developer Certificate of Origin

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obulat commented Dec 12, 2022

Constraints

Prior work

Proposal

Note:

AetherUnbound left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zackkrida commented Dec 14, 2022

krysal commented Dec 14, 2022 • edited Loading

Constraints

Proposal

Note:

stacimc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obulat commented Dec 15, 2022

obulat commented Dec 15, 2022

obulat commented Dec 15, 2022

obulat commented Dec 15, 2022

stacimc commented Dec 16, 2022 • edited Loading

zackkrida commented Dec 16, 2022

krysal commented Dec 16, 2022

krysal commented Feb 9, 2023

krysal commented Dec 14, 2022 •

edited

Loading

stacimc commented Dec 16, 2022 •

edited

Loading