Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check crawled images have the correct URI protocol #1411

Closed
krysal opened this issue Oct 7, 2022 · 4 comments
Closed

Check crawled images have the correct URI protocol #1411

krysal opened this issue Oct 7, 2022 · 4 comments
Assignees
Labels
📄 aspect: text Concerns the textual material in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@krysal
Copy link
Member

krysal commented Oct 7, 2022

Description

Old images from the ingestion through the common crawl process don't follow the recommended HTTPS protocol (or any at all) which is available in some cases. This is fixed in a step of the reingestion process but we want to delete this step moving forward, so we need to ensure this data is fixed in the upstream database.

Cleanup in the Ingestion server

https://github.com/WordPress/openverse-api/blob/429fd45916c9e064ccea772afc184466304bce4e/ingestion_server/ingestion_server/cleanup.py#L72-L92

https://github.com/WordPress/openverse-api/blob/429fd45916c9e064ccea772afc184466304bce4e/ingestion_server/ingestion_server/cleanup.py#L149-L158

Benefit

See #1663.

@krysal krysal added 🟧 priority: high Stalls work on the project or its dependents 📄 aspect: text Concerns the textual material in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users data normalization labels Oct 7, 2022
@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 23, 2023
@obulat obulat added 🟨 priority: medium Not blocking but should be addressed soon and removed 🟧 priority: high Stalls work on the project or its dependents labels Mar 28, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Apr 17, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@dhruvkb dhruvkb added this to the Data normalization milestone Dec 2, 2023
@AetherUnbound AetherUnbound moved this from 📋 Backlog to 📅 To Do in Openverse Backlog Apr 1, 2024
@AetherUnbound
Copy link
Collaborator

@krysal it seems like this will be handled by #4163 and the subsequent batched update using the output TSVs, is that correct? Is there any work that needs to be done on this explicitly?

@AetherUnbound
Copy link
Collaborator

I ran some queries just so we could get some perspective on the affected records:

openledger=> select count(*) from image where url not ilike 'http%';
 count
-------
 22157
(1 row)


openledger=> select count(*) from image where creator_url not ilike 'http%';
  count
---------
 9035590
(1 row)


openledger=> select count(*) from image where foreign_landing_url not ilike 'http%';
  count
---------
 8809342
(1 row)

@krysal
Copy link
Member Author

krysal commented May 24, 2024

@krysal it seems like this will be handled by #4163 and the subsequent batched update using the output TSVs, is that correct? Is there any work that needs to be done on this explicitly?

That's correct. It is a verification task after processing the files of cleaned data. We should wait for several ingestion workflows, and data refreshes to run to verify that new images have the fields filled out correctly.

@AetherUnbound AetherUnbound added the ⛔ status: blocked Blocked & therefore, not ready for work label May 24, 2024
@openverse-bot openverse-bot moved this from 📅 To Do to ⛔ Blocked in Openverse Backlog May 24, 2024
@krysal krysal removed the ⛔ status: blocked Blocked & therefore, not ready for work label Jul 31, 2024
@openverse-bot openverse-bot moved this from ⛔ Blocked to 📋 Backlog in Openverse Backlog Jul 31, 2024
@krysal krysal moved this from Todo to In Progress in Openverse Data Normalization Jul 31, 2024
@krysal krysal moved this from 📋 Backlog to 🏗 In Progress in Openverse Backlog Jul 31, 2024
@krysal krysal self-assigned this Aug 8, 2024
@krysal
Copy link
Member Author

krysal commented Aug 8, 2024

The last DAG run for the image data refresh didn't produce new files and I checked querying upstream too:

openledger=> SELECT COUNT(*) FROM image WHERE creator_url NOT LIKE 'https://%' AND creator_url NOT LIKE 'http://%';
 count
-------
     0
(1 row)

openledger=> SELECT COUNT(*) FROM image WHERE foreign_landing_url NOT LIKE 'https://%' AND foreign_landing_url NOT LIKE 'http://%';
 count
-------
     0
(1 row)

openledger=> SELECT COUNT(*) FROM image WHERE url NOT LIKE 'https://%' AND url NOT LIKE 'http://%';
 count
-------
     0
(1 row)

This is done.

@krysal krysal closed this as completed Aug 8, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Openverse Data Normalization Aug 8, 2024
@openverse-bot openverse-bot moved this from 🏗 In Progress to ✅ Done in Openverse Backlog Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

No branches or pull requests

4 participants