Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image rows with invalid version number for PDM #4696

Closed
krysal opened this issue Aug 1, 2024 · 2 comments
Closed

Image rows with invalid version number for PDM #4696

krysal opened this issue Aug 1, 2024 · 2 comments
Assignees
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@krysal
Copy link
Member

krysal commented Aug 1, 2024

Description

From the add_license_url DAG run, 43 images were found with an invalid license, "PDM 4.0" (version four specifically doesn't exist).

The following *invalid license(s)* were found and will be skipped:
╭────┬───────────┬───────────┬─────────╮
│    │ license   │   version │   count │
├────┼───────────┼───────────┼─────────┤
│  0 │ pdm       │       4.0 │      43 │
╰────┴───────────┴───────────┴─────────╯
Details of rows implied

These are the identifiers of the rows in a query to get more information.

Extracted with this query;

SELECT identifier FROM image WHERE license = 'pdm' AND license_url = '4.0';
SELECT * FROM image WHERE identifier IN (
'394789b1-59ed-477c-8cab-2c12c44f8881',
'c2cd1c63-c1c6-4ae3-86ba-a559878e5c29',
'a565409e-7b34-4729-9a03-ef7a2bbedf69',
'4a3477a5-88fe-4182-8075-1331fb025bdf',
'508e0900-3893-48af-a2a8-ca7f349a1bd0',
'a8581655-211d-4468-96e9-e91dda27ec16',
'02728167-46d9-44e4-9666-55c0f0b97bef',
'5a51a5e7-e511-4886-b410-42f659130aac',
'6965802b-a93a-453d-88e1-e4c414f8cd38',
'2b37a989-f5c1-4f9a-9c5f-07c64b7f455b',
'5c5d6856-6d7f-47df-9490-96c5900e016d',
'cf8d179d-0944-42e3-bf11-488765b22bd2',
'0a6f1a62-abd5-49e0-8f93-5c068c0aa6fa',
'162b0827-2825-43d2-ac64-c7e8575411a0',
'4414f2c9-0dcb-47f8-9998-b2a34e8b7863',
'9e95075a-2aaa-4683-aae2-d8d20f1343fb',
'32959d92-9141-4810-8e79-85fccf848027',
'f59aa983-1c54-4a61-8985-6a4e264b1c31',
'521123e9-1a5c-42d8-a7d8-464e263b84d1',
'8888f5c8-4ad1-488e-8cad-bbd1cc04334a',
'57a87e1d-61a6-4df5-accc-0cae8339ab1b',
'5492c666-cbd3-4bde-996b-47c9d4abeb62',
'ad00fa52-7a62-49b3-840a-cf09f1eebf61',
'b74a4327-f92f-4df0-bb56-660a39ac7fe4',
'f8f33710-2e7a-44c4-9cb6-1777b2086904',
'6221eebc-3e45-4c22-9ef8-c1f9ea766ade',
'4f506b2a-1d86-4760-a6ff-fcecb5f3491c',
'b2e6be7f-bf6f-4ccc-a1dd-725e7c42e9d1',
'9aba2139-d1f3-47d8-9a58-e3fe18006ff5',
'c6080bb3-b314-4e66-b413-9e5132837b28',
'fa1eee5c-8149-4646-89b2-c97dd9d04fe1',
'5c3180cd-4ab8-496e-98bc-55296e946deb',
'9a40cf3e-6f95-466b-8561-a977820197f2',
'd9fc460a-4200-4ccb-a663-852fc58b9538',
'be98bf10-7166-4f16-9e1b-98ed0d1e57ed',
'e344e6ee-41a4-4823-8679-aa8d9ce4ba36',
'dfe52b22-24c5-483b-88a5-6655af0bb1ec',
'b341be5d-021b-4724-b5a2-80de20badb5d',
'27b7e367-52e1-44df-9ffc-53103431d019',
'94a730f3-6b1e-467a-b0e4-c36e7c8616a0',
'b8b87ea5-4e09-487f-acd8-b73c6116f8c0',
'3262db1c-b3fa-4ee9-9c14-84c1ce5c23c6',
'60e346e8-e492-47e8-b5a9-353bc3bd289e'
);

On an analysis of these items, besides the confusion of potentially dealing with a CC0 license instead, given the "License information" provided in landing pages, see, for example, https://digitaltmuseum.org/021176009794/arko-1957, it was determined that these are, in fact, under PDM. From @AetherUnbound:

In the HTML source for this page, there's a metadata block (set to app.data), and all the references to license within it refer to PDM (in a few locations)

"licenses":[{"description":"Public Domain-m\u00e4rke (PDM)","label":"","link":null}]

Proposed solution

To fix this, we'll change the version to "1.0" like the other existing marks in the database. Note that in the query below, the extra two conditions of the last part are not strictly necessary but add a safety check.

UPDATE image SET license_version = '1.0', updated_on = NOW()
WHERE identifier IN (<list copied above>) AND license = 'pdm' AND license_version = '4.0';

Environment

This is in the catalog database.

Additional context

Related to #3885 and #4318.

@krysal krysal added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 🧱 stack: catalog Related to the catalog and Airflow DAGs 🗄️ aspect: data Concerns the data in our catalog and/or databases labels Aug 1, 2024
@krysal krysal self-assigned this Aug 1, 2024
@AetherUnbound
Copy link
Collaborator

The query makes sense! Do we also need to update the license_url field in meta_data as part of this query, or will we rerun add_license_url after it?

@krysal
Copy link
Member Author

krysal commented Aug 1, 2024

I'll rerun the DAG. Thanks for double-checking with me!

Edit: applied. This is done.

@krysal krysal closed this as completed Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

No branches or pull requests

2 participants