Data normalization #430

obulat · 2023-02-18T05:23:18Z

Start Date	Project Lead	Actual Ship Date
2023-09-01	@krysal	TBD

Description

This project aims to save the cleaned data of the Data Refresh process and remove those steps from said process to save time.

Documents

Implementation plan: Catalog Data Cleaning #3848

Milestone / Issues

Prior Art

Future work - Phase Two

Normalize data models #244

Prerequisites

The text was updated successfully, but these errors were encountered:

krysal · 2024-03-06T16:37:04Z

The implementation plan is up for discussion at #3848. Writing it helped me ensure where we were starting from and define a scope for the project while indicating what could be done in a second phase, as suggested in the initial post. I hope others find it helpful too.

After its approval, the milestone should be complemented with a some issues:

Modify Ingestion Server to upload TSV files to AWS S3 and save fixed tags
Check cleanup steps times of the Ingestion Server after running the batched update from files DAG.

krysal · 2024-04-03T15:02:23Z

Since the last update, the IP has been approved, and work has started on fixing duplicated tags. This has been a bit delayed, given solution proposal differences, but once the modification to the catalog is solved (#3926), we can delete current duplicates in upstream DB (#1566) and continue with the rest of the milestone (#23).

openverse-bot · 2024-04-18T00:21:51Z

Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

krysal · 2024-04-24T23:09:29Z

Done

Remove duplicated tags #1566
Backfill license_url field for images where it's null in the meta_data #3885 – The code part is done, the DAG was triggered and is currently running.

In progress

Save cleaned data of Ingestion Server to AWS S3 #4163

Added

Remove and de-duplicate tags with leading/trailing whitespace #4199 – @AetherUnbound recently told me about this. The tags will need extra processing before a definitive win over duplicates is called.

openverse-bot · 2024-05-09T00:22:10Z

Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

krysal · 2024-05-13T15:01:30Z

Done

In progress

Previous merged PRs should solve #3912. I'm waiting for a run of the image data refresh to confirm we save and have the files, which is currently stopped/blocked on #4315 but that should be resolved between today and tomorrow. So I'm hoping the process is resumed soon and we can have the files this week.

To do

In the meantime, I can work on the next step:

Use the batched_update DAG with stored CSVs to update Catalog URLs #3415

krysal · 2024-05-24T22:17:08Z

An image data refresh in production couldn't finish with the changes from #4163, so we added more logging #4358, rolled back prod ingestion server, and decided to perform the cleanup process on the dev environment. An attempt with a data refresh limit resurrected an old problem (#736, #4381), which we already have a fix for, #4382. After merging #4382 on Monday, we must deploy the dev ingestion server and trigger the image data refresh to continue debugging.

The add_license_url DAG presented some issues of time outs as well and was refactored. The PR is pending revision:

Warn on license_url computation in the API #4198

openverse-bot · 2024-06-08T00:23:56Z

Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

krysal · 2024-06-14T01:43:22Z

Done

The project was restructured in the way the updates are expected to be executed (using the batched_update DAG or a similar process mainly) and reducing the scope since now we don't need to remove tags (originated by discussion in Implementation Plan: Augment the catalog database with suitable Rekognition tags #4417)
The add_license_url DAG ran successfully two times with the latest update (Modify add_license_url DAG to use batched_update #4370), although strangely, a group of rows is receding and losing the value. Some recently updated images are missing license_url in the meta_data field #4318 was created to track this problem.

In progress

We'll still use the TSV files with fixed URLs from the ingestion Server, so the PR to upload them to S3 is up and ready for revision: Remove single quotes in values of Ingestion Server's TSV files #4471
@sarayourfriend created the configuration necessary to fix Remove and de-duplicate tags with leading/trailing whitespace #4199 in a one-off DAG that, with the last touch of Fix trim and deduplicate tags deduplication #4473, should be ready to run
Decode and deduplicate tags in the catalog with a TargetedReingestionDAG #4452 is being discussed and also worked on

To do

I'm working on Use the batched_update DAG with stored CSVs to update Catalog URLs #3415. I did some manual tests and managed to load a table from the S3 files directly in staging, finding the extension for importing data into an RDS for PostgreSQL extremely useful. It can save us many headaches with networking, local file managing, and potential disk space issues. The second part of the task is actually performing the updates. I'm looking into whether batch updates can be parallelized (as done on the ingestion server).

sarayourfriend · 2024-06-14T03:05:14Z

I did some manual tests and managed to load a table from the S3 files directly in staging, finding the extension for importing data into an RDS for PostgreSQL extremely useful.

For what it's worth @krysal, you can definitely test that locally, rather than needing to use a live environment. We use the extension already for iNaturalist, so there are examples in the codebase of how to do it (including with support for local files for testing and development). Check this one out, for example:

openverse/catalog/dags/providers/provider_csv_load_scripts/inaturalist/observations.sql

Line 28 in 697f62f

SELECT aws_s3.table_import_from_s3('inaturalist.observations',

krysal · 2024-06-14T13:58:41Z

@sarayourfriend I did not think of iNaturalist as a reference here, and the relationship had not been mentioned until now. That's good to know! I thought of testing in the staging DB first because, from the documentation, I understood the extension is specifically for an Amazon RDS Postgres instance, so it's excellent information to know it works for local Postgres instance. Thank you!

krysal · 2024-06-28T21:43:57Z

Done

Fix placing test S3 data into MinIO #4495. Unanticipated. Required to simulate working with S3 locally in the catalog.
Upload Ingestion Server's TSV files to AWS S3 #3912
- Required additionally: Remove single quotes in values of Ingestion Server's TSV files #4471
- The changes were deployed live today, so next week, we should start getting freshly cleaned values directly into S3.
Remove duplicated tags #1566

In progress

Decode and deduplicate tags in the catalog with a TargetedReingestionDAG #4452
- Partially solved by Add DAG to decode and deduplicate image tags with escaped literal unicode sequences #4475
Use the batched_update DAG with stored CSVs to update Catalog URLs #3415. I couldn't work much on this while resolving other issues, but now (hopefully) I'll be able to focus on it.

krysal · 2024-07-12T22:53:44Z

Add DAG to decode and deduplicate image tags with escaped literal unicode sequences #4475
- It wasn't possible to run the DAG, so the approach for Decode and deduplicate tags in the catalog with a TargetedReingestionDAG #4452 must be changed.
Add catalog_cleaner DAG #4610
- Created an awaiting revision.

This week maintainers were off from Openverse work so the tasks will be resumed next week.

krysal · 2024-07-26T22:58:07Z

The catalog_cleaner DAG ran for the programmed fields successfully, so it habilitates #1411 and #700 for next week after the data refresh if the process doesn't produce more files with changes :)

Besides that, what remains to do is #4452.

openverse-bot · 2024-08-10T00:24:41Z

Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

krysal · 2024-08-16T22:22:30Z

Done

Remove the URL cleanup process from the ingestion server #700 The Ingestion Server was deployed on Wednesday so next week we we will check the time gained with other data refresh run.
Unify data refresh/provider cleaning #1663

To Do

Decode and deduplicate tags in the catalog with a TargetedReingestionDAG #4452 This issue has become more complex than initially planned. It was blocked by Change tag upsert strategy to drop old provider tags #4732, so given this is was resolved it can be resumed and @sarayourfriend expressed interest in continue to solve it.
Verify the next image data refresh runs successfully.

openverse-bot · 2024-08-31T00:25:42Z

Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

zackkrida · 2024-09-05T20:19:55Z

@WordPress/openverse-maintainers last week @krysal and I discussed the idea of sunsetting this project, with #4452 extracted out as a standalone issue to be worked on later this year.

In hindsight, this project was defined with two goals that were a bit less clear than we initially thought:

The catalog database (upstream) contains the cleaned data outputs of the
current Ingestion Server's cleaning steps

The image Data Refresh process is simplified by reducing significantly
cleaning times.

The first goal, in particular, is very open to interpretation and changes over time. Our data will never be perfect; does that mean we need to incorporate every new cleanup action into the scope of this work? That seems untenable.

The goal to remove the cleanup step from the data refresh has been met; I propose we close this project and move on.

If anyone objects: please share. Otherwise, I'll ask @krysal to move the project to success and close this issue next week.

sarayourfriend · 2024-09-05T21:19:24Z

I agree. The first goal actually is clear (in my reading), in that it specifies the "outputs of the current Ingestion Server's cleaning steps". I think, rather, we've let the scope get away from that boundary of the ingestion server cleaning steps, into a total "data cleaning" project.

AetherUnbound · 2024-09-06T16:27:58Z

Definitely okay closing this out based on that - the rest of the data cleaning issues that come up we can prioritize alongside other work!

zackkrida · 2024-09-11T15:13:32Z

This project has been closed and moved to success.

obulat added the 🧭 project: thread An issue used to track a project and its progress label Feb 18, 2023

obulat assigned krysal Feb 18, 2023

github-project-automation bot added this to Openverse Project Tracker Feb 18, 2023

github-project-automation bot moved this to Not Started in Openverse Project Tracker Feb 18, 2023

obulat moved this from Not Started to On Hold in Openverse Project Tracker Feb 18, 2023

This comment was marked as outdated.

Sign in to view

AetherUnbound moved this from ⏸ On Hold to 🚧 In Progress in Openverse Project Tracker Jan 3, 2024

krysal added this to the Data normalization milestone Feb 5, 2024

AetherUnbound moved this from 🚧 In Progress to 💬 In RFC in Openverse Project Tracker Feb 7, 2024

AetherUnbound moved this from 💬 In RFC to 🚀 In Kickoff in Openverse Project Tracker Feb 7, 2024

krysal modified the milestones: Data normalization II, Data normalization Feb 20, 2024

krysal mentioned this issue Feb 29, 2024

Implementation plan: Catalog Data Cleaning #3848

Merged

krysal moved this from 🚀 In Kickoff to 💬 In RFC in Openverse Project Tracker Mar 6, 2024

krysal mentioned this issue Mar 13, 2024

Upload Ingestion Server's TSV files to AWS S3 #3912

Closed

obulat moved this from 💬 In RFC to 🚧 In Progress in Openverse Project Tracker Mar 19, 2024

AetherUnbound mentioned this issue Apr 24, 2024

Remove and de-duplicate tags with leading/trailing whitespace #4199

Closed

sarayourfriend mentioned this issue Jun 3, 2024

Implementation Plan: Augment the catalog database with suitable Rekognition tags #4417

Merged

2 tasks

This was referenced Jun 5, 2024

Incorporate Rekognition data into the catalog #431

Open

Document current & desired ETL steps and data flow #4455

Closed

Update ingestion server removal IP to include plan for filtering tags #4456

Closed

sarayourfriend mentioned this issue Sep 1, 2024

Use a persistent cache for tldextract across CI runs #3716

Closed

zackkrida moved this from 🚧 In Progress to ✅ Success in Openverse Project Tracker Sep 11, 2024

zackkrida closed this as completed Sep 11, 2024

github-project-automation bot moved this from ✅ Success to 🚢 Shipped in Openverse Project Tracker Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data normalization #430

Data normalization #430

obulat commented Feb 18, 2023 •

edited by krysal

Loading

This comment was marked as outdated.

krysal commented Mar 6, 2024 •

edited by zackkrida

Loading

krysal commented Apr 3, 2024

openverse-bot commented Apr 18, 2024

krysal commented Apr 24, 2024

openverse-bot commented May 9, 2024

krysal commented May 13, 2024

krysal commented May 24, 2024 •

edited

Loading

openverse-bot commented Jun 8, 2024

krysal commented Jun 14, 2024

sarayourfriend commented Jun 14, 2024

krysal commented Jun 14, 2024

krysal commented Jun 28, 2024

krysal commented Jul 12, 2024

krysal commented Jul 26, 2024

openverse-bot commented Aug 10, 2024

krysal commented Aug 16, 2024 •

edited by zackkrida

Loading

openverse-bot commented Aug 31, 2024

zackkrida commented Sep 5, 2024

sarayourfriend commented Sep 5, 2024

AetherUnbound commented Sep 6, 2024

zackkrida commented Sep 11, 2024

Data normalization #430

Data normalization #430

Comments

obulat commented Feb 18, 2023 • edited by krysal Loading

Description

Documents

Milestone / Issues

Prior Art

Future work - Phase Two

Prerequisites

This comment was marked as outdated.

krysal commented Mar 6, 2024 • edited by zackkrida Loading

krysal commented Apr 3, 2024

openverse-bot commented Apr 18, 2024

krysal commented Apr 24, 2024

Done

In progress

Added

openverse-bot commented May 9, 2024

krysal commented May 13, 2024

Done

In progress

To do

krysal commented May 24, 2024 • edited Loading

openverse-bot commented Jun 8, 2024

krysal commented Jun 14, 2024

Done

In progress

To do

sarayourfriend commented Jun 14, 2024

krysal commented Jun 14, 2024

krysal commented Jun 28, 2024

Done

In progress

krysal commented Jul 12, 2024

krysal commented Jul 26, 2024

openverse-bot commented Aug 10, 2024

krysal commented Aug 16, 2024 • edited by zackkrida Loading

Done

To Do

openverse-bot commented Aug 31, 2024

zackkrida commented Sep 5, 2024

sarayourfriend commented Sep 5, 2024

AetherUnbound commented Sep 6, 2024

zackkrida commented Sep 11, 2024

obulat commented Feb 18, 2023 •

edited by krysal

Loading

krysal commented Mar 6, 2024 •

edited by zackkrida

Loading

krysal commented May 24, 2024 •

edited

Loading

krysal commented Aug 16, 2024 •

edited by zackkrida

Loading