-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some recently updated images are missing license_url
in the meta_data
field
#4318
Comments
On Friday I started another run of the
Exploring what rows are involved this is the distribution by providers and sourcers. Flickr is concentrating most of it. openledger=> SELECT provider, source, COUNT(*) FROM image
WHERE meta_data->>'license_url' IS NULL
AND NOT (license = 'pdm' AND license_version = '4.0')
GROUP BY 1, 2;
provider | source | count
-----------------+-----------------+---------
clevelandmuseum | clevelandmuseum | 3
flickr | bio_diversity | 369
flickr | flickr | 4914694
flickr | nasa | 166
met | met | 4500
(5 rows) Querying the Cleveland Museum rows I confirmed the -- Rows from clevelandmuseum
identifier | license | license_version | created_on | updated_on | removed_from_source
--------------------------------------+---------+-----------------+------------------------------+-------------------------------+---------------------
4ea23a17-5528-4e22-ba4f-d51bdfd1515a | cc0 | 1.0 | 2019-01-08 18:24:53.33183+00 | 2024-05-28 22:07:06.613201+00 | t
253bdd81-64a7-458d-bdb5-a1c4e0027c1f | cc0 | 1.0 | 2019-01-08 18:24:53.33183+00 | 2024-05-28 22:06:28.443214+00 | t
64300d6b-85f7-48ee-a84b-9870aeaaf568 | cc0 | 1.0 | 2019-08-12 16:00:40.10698+00 | 2024-05-28 22:06:28.443214+00 | t
|
Description
On 2024-05-08 UTC the
batched_update
DAG was triggered1 to fill thelicense_url
in themeta_data
field with its corresponding value for rowsWHERE license = 'by' AND license_version = '2.0'
, and it reported a successful end on 2024-05-09, 17:00:18 UTC updating 746,571 records. However, after triggering a run of theadd_license_url
DAG on 2024-05-10, it reported the same row number missing said license, which indicates that some workflows may not be filling this field or are overwriting it.Flicker is confirmed to be on the set of rows missing this value.
If there are more, it is to be confirmed. It is known the Flickr DAG was running those days, as well as Europeana, the Finnish Museum, Wikimedia Commons, and the Metropolitan Museum.
Screenshot of DAG reports on Thursday, May 9th. Time is in VET.
Additional context
Discovered while working on #3885.
Footnotes
Link only available to maintainers. ↩
The text was updated successfully, but these errors were encountered: