-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check crawled images have the correct URI protocol #1411
Comments
I ran some queries just so we could get some perspective on the affected records: openledger=> select count(*) from image where url not ilike 'http%';
count
-------
22157
(1 row)
openledger=> select count(*) from image where creator_url not ilike 'http%';
count
---------
9035590
(1 row)
openledger=> select count(*) from image where foreign_landing_url not ilike 'http%';
count
---------
8809342
(1 row) |
That's correct. It is a verification task after processing the files of cleaned data. We should wait for several ingestion workflows, and data refreshes to run to verify that new images have the fields filled out correctly. |
The last DAG run for the image data refresh didn't produce new files and I checked querying upstream too: openledger=> SELECT COUNT(*) FROM image WHERE creator_url NOT LIKE 'https://%' AND creator_url NOT LIKE 'http://%';
count
-------
0
(1 row)
openledger=> SELECT COUNT(*) FROM image WHERE foreign_landing_url NOT LIKE 'https://%' AND foreign_landing_url NOT LIKE 'http://%';
count
-------
0
(1 row)
openledger=> SELECT COUNT(*) FROM image WHERE url NOT LIKE 'https://%' AND url NOT LIKE 'http://%';
count
-------
0
(1 row) This is done. |
Description
Old images from the ingestion through the common crawl process don't follow the recommended HTTPS protocol (or any at all) which is available in some cases. This is fixed in a step of the reingestion process but we want to delete this step moving forward, so we need to ensure this data is fixed in the upstream database.
Cleanup in the Ingestion server
https://github.com/WordPress/openverse-api/blob/429fd45916c9e064ccea772afc184466304bce4e/ingestion_server/ingestion_server/cleanup.py#L72-L92
https://github.com/WordPress/openverse-api/blob/429fd45916c9e064ccea772afc184466304bce4e/ingestion_server/ingestion_server/cleanup.py#L149-L158
Benefit
See #1663.
The text was updated successfully, but these errors were encountered: