Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backfill license_url field for images where it's null in the meta_data #3885

Closed
krysal opened this issue Mar 7, 2024 · 2 comments · Fixed by #4124
Closed

Backfill license_url field for images where it's null in the meta_data #3885

krysal opened this issue Mar 7, 2024 · 2 comments · Fixed by #4124
Assignees
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon python Pull requests that update Python code 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow

Comments

@krysal
Copy link
Member

krysal commented Mar 7, 2024

Problem

As a requirement for #703, we must fill the metadata.license_url field for images without it. At the moment of writing this issue, the number amounts to 97.7 million in the upstream DB.

SELECT COUNT(identifier) FROM image WHERE meta_data->>'license_url' IS NULL;
+----------+
| count    |
|----------|
| 97775299 |
+----------+
SELECT 1
Time: 2732.486s (45 minutes 32 seconds), executed in: 2732.465s (45 minutes 32 seconds)

Description

This potentially could be solved by a one-off DAG that gets the license URL computed from the license and license_version fields. @obulat made one DAG previously for filling one of the cases when the license URL is null, see #1005.

Additional context

See previous work at #1565.

@krysal krysal added 🟨 priority: medium Not blocking but should be addressed soon 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🔧 tech: airflow Involves Apache Airflow 🧱 stack: catalog Related to the catalog and Airflow DAGs python Pull requests that update Python code 🗄️ aspect: data Concerns the data in our catalog and/or databases labels Mar 7, 2024
@krysal krysal self-assigned this Mar 7, 2024
@krysal
Copy link
Member Author

krysal commented May 13, 2024

Reopening this since it's still being work out.

@krysal
Copy link
Member Author

krysal commented May 29, 2024

The DAG to backfill was triggered yesterday at 14:59:16 UTC and ended successfully after ~8 hours. This is completed. However, we will keep the DAG for some time to see if #4318 keeps repeating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon python Pull requests that update Python code 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant