Remove duplicated tags #1566

stacimc · 2022-05-17T15:52:58Z

Problem

Here's an example on the frontend of an image with duplicated tags: https://search-staging.openverse.engineering/image/9fc28cae-ec9b-437a-a960-98f48db2cac8 (coffee, lol, lol).

Description

We should remove duplicated tags as part of the cleaning steps when loading ingested data.

For backfilling: Non-dated DAGs consume all the data on each run and ought to update old records on the next DAG run. For dated DAGs, we may need to write a backfill query.

Implementation

🙋 I would be interested in implementing this feature.

The text was updated successfully, but these errors were encountered:

AetherUnbound · 2024-01-24T04:21:56Z

Here's another example: https://openverse.org/image/9ab8329c-8037-4fd4-8d55-8a52eb0dca90 (with a lot more duplicates).

obulat · 2024-02-08T08:19:38Z

@AetherUnbound, the catalog database for the item you shared contains duplicate tags:

"[{""name"": ""1768"", ""provider"": ""flickr""}, {""name"": ""1768"", ""provider"": ""flickr""}, {""name"": ""1908"", ""provider"": ""flickr""}, {""name"": ""1908"", ""provider"": ""flickr""}, {""name"": ""allthingsgold"", ""provider"": ""flickr""}, {""name"": ""allthingsgold"", ""provider"": ""flickr""}, {""name"": ""art"", ""provider"": ""flickr""}, {""name"": ""art"", ""provider"": ""flickr""}, {""name"": ""arthistory"", ""provider"": ""flickr""}, {""name"": ""arthistory"", ""provider"": ""flickr""}, {""name"": ""artist"", ""provider"": ""flickr""}, {""name"": ""artist"", ""provider"": ""flickr""}, {""name"": ""avirtualmuseum"", ""provider"": ""flickr""}, {""name"": ""avirtualmuseum"", ""provider"": ""flickr""}, {""name"": ""cardiff"", ""provider"": ""flickr""}, {""name"": ""cardiff"", ""provider"": ""flickr""}, {""name"": ""claudemonet"", ""provider"": ""flickr""}, {""name"": ""claudemonet"", ""provider"": ""flickr""}, {""name"": ""colection"", ""provider"": ""flickr""}, {""name"": ""colection"", ""provider"": ""flickr""}, {""name"": ""colour"", ""provider"": ""flickr""}, {""name"": ""colour"", ""provider"": ""flickr""}, {""name"": ""frenchimpressionist"", ""provider"": ""flickr""}, {""name"": ""frenchimpressionist"", ""provider"": ""flickr""}, {""name"": ""giorgio"", ""provider"": ""flickr""}, {""name"": ""giorgio"", ""provider"": ""flickr""}, {""name"": ""gold"", ""provider"": ""flickr""}, {""name"": ""gold"", ""provider"": ""flickr""}, {""name"": ""golden"", ""provider"": ""flickr""}, {""name"": ""golden"", ""provider"": ""flickr""}, {""name"": ""goldenlight"", ""provider"": ""flickr""}, {""name"": ""goldenlight"", ""provider"": ""flickr""}, {""name"": ""impressionism"", ""provider"": ""flickr""}, {""name"": ""impressionism"", ""provider"": ""flickr""}, {""name"": ""museum"", ""provider"": ""flickr""}, {""name"": ""museum"", ""provider"": ""flickr""}, {""name"": ""national"", ""provider"": ""flickr""}, {""name"": ""national"", ""provider"": ""flickr""}, {""name"": ""paint"", ""provider"": ""flickr""}, {""name"": ""paint"", ""provider"": ""flickr""}, {""name"": ""painter"", ""provider"": ""flickr""}, {""name"": ""painter"", ""provider"": ""flickr""}, {""name"": ""san"", ""provider"": ""flickr""}, {""name"": ""san"", ""provider"": ""flickr""}, {""name"": ""texture"", ""provider"": ""flickr""}, {""name"": ""texture"", ""provider"": ""flickr""}, {""name"": ""venice"", ""provider"": ""flickr""}, {""name"": ""venise"", ""provider"": ""flickr""}, {""name"": ""venise"", ""provider"": ""flickr""}, {""name"": ""w"", ""provider"": ""flickr""}, {""name"": ""w"", ""provider"": ""flickr""}, {""name"": ""wales"", ""provider"": ""flickr""}, {""name"": ""wales"", ""provider"": ""flickr""}]"

This should not happen with items that have been recently re-ingested because the tags column removes non-unique tags when upserting new values into the table:

openverse/catalog/dags/common/storage/columns.py

Lines 688 to 690 in ff1b5d5

    
           TAGS = JSONColumn( 
        
               name="tags", required=False, upsert_strategy=UpsertStrategy.merge_jsonb_arrays 
        
           )

The item in question, however, was not updated since 2020-10-01, and I'm not sure whether the code to remove duplicate tags was ran at ingestion at that time.

What would the best fix for such items be in the catalog? Would running a batched update that would select items that have not been updated since some date in the past, and run something like SELECT jsonb_agg(DISTINCT x) FROM jsonb_array_elements(tags) on them be a good idea?

AetherUnbound · 2024-02-08T18:01:40Z

What would the best fix for such items be in the catalog? Would running a batched update that would select items that have not been updated since some date in the past, and run something like SELECT jsonb_agg(DISTINCT x) FROM jsonb_array_elements(tags) on them be a good idea?

That sounds like it could work! The only issue is the number of records that might be affected by that update could be significant 🤔 Do you think we could get a count of the records that have duplicates in them before trying to run any batched update? That might give us a sense of how pervasive the issue is.

obulat · 2024-02-09T08:57:22Z

Do you think we could get a count of the records that have duplicates in them before trying to run any batched update?

The only way I can think of getting this information is by adding a new step to the data cleanup in the ingestion server. But this would make the data refresh run longer.

I wonder how many items were not updated since 2020? Maybe the non-updated set of items is smaller?

AetherUnbound · 2024-02-09T17:36:29Z

After some fiddling around, it looks like we are able to query for this!!

deploy@localhost:openledger> 
    SELECT *
    FROM (
        SELECT identifier, provider, updated_on,
            tags || '[]'::jsonb AS tags,
            (SELECT jsonb_agg(DISTINCT x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags FROM image
    ) u
    WHERE jsonb_array_length(u.tags) > jsonb_array_length(u.unique_tags) limit 1;
-[ RECORD 1 ]-------------------------
identifier  | 95aeace6-f70d-420c-a08e-84271ba6cad3
provider    | animaldiversity
updated_on  | 2019-02-15 14:47:33.392219+00
tags        | [{"name": "Reproduction", "provider": "animaldiversity"}, {"name": "Body Parts", "provider": "animaldiversity"}, {"name": "Photo", "provider": "animaldiversity"}, {"name": "Live Animal", "provider": "animaldiversity"}, {"name": "Female", "provider": "animaldiversity"}, {"name": "Female", "provider": "animaldiversity"}, {"name": "Adult/Sexually Mature", "provider": "animaldiversity"}, {"name": "Reproductive Organ", "provider": "animaldiversity"}, {"name": "insect", "accuracy": "0.99904", "provider": "clarifai"}, {"name": "dragonfly", "accuracy": "0.99758", "provider": "clarifai"}, {"name": "nature", "accuracy": "0.99558", "provider": "clarifai"}, {"name": "fly", "accuracy": "0.99301", "provider": "clarifai"}, {"name": "animal", "accuracy": "0.99085", "provider": "clarifai"}, {"name": "wildlife", "accuracy": "0.98856", "provider": "clarifai"}, {"name": "pest", "accuracy": "0.98646", "provider": "clarifai"}, {"name": "invertebrate", "accuracy": "0.9839", "provider": "clarifai"}, {"name": "leaf", "accuracy": "0.97928", "provider": "clarifai"}, {"name": "damselfly", "accuracy": "0.96964", "provider": "clarifai"}, {"name": "outdoors", "accuracy": "0.96327", "provider": "clarifai"}, {"name": "little", "accuracy": "0.9622", "provider": "clarifai"}, {"name": "wing", "accuracy": "0.95252", "provider": "clarifai"}, {"name": "close", "accuracy": "0.95173", "provider": "clarifai"}, {"name": "spider", "accuracy": "0.94385", "provider": "clarifai"}, {"name": "wild", "accuracy": "0.94351", "provider": "clarifai"}, {"name": "closeup", "accuracy": "0.94158", "provider": "clarifai"}, {"name": "garden", "accuracy": "0.93974", "provider": "clarifai"}, {"name": "entomology", "accuracy": "0.93532", "provider": "clarifai"}]
unique_tags | [{"name": "Adult/Sexually Mature", "provider": "animaldiversity"}, {"name": "Body Parts", "provider": "animaldiversity"}, {"name": "Female", "provider": "animaldiversity"}, {"name": "Live Animal", "provider": "animaldiversity"}, {"name": "Photo", "provider": "animaldiversity"}, {"name": "Reproduction", "provider": "animaldiversity"}, {"name": "Reproductive Organ", "provider": "animaldiversity"}, {"name": "animal", "accuracy": "0.99085", "provider": "clarifai"}, {"name": "close", "accuracy": "0.95173", "provider": "clarifai"}, {"name": "closeup", "accuracy": "0.94158", "provider": "clarifai"}, {"name": "damselfly", "accuracy": "0.96964", "provider": "clarifai"}, {"name": "dragonfly", "accuracy": "0.99758", "provider": "clarifai"}, {"name": "entomology", "accuracy": "0.93532", "provider": "clarifai"}, {"name": "fly", "accuracy": "0.99301", "provider": "clarifai"}, {"name": "garden", "accuracy": "0.93974", "provider": "clarifai"}, {"name": "insect", "accuracy": "0.99904", "provider": "clarifai"}, {"name": "invertebrate", "accuracy": "0.9839", "provider": "clarifai"}, {"name": "leaf", "accuracy": "0.97928", "provider": "clarifai"}, {"name": "little", "accuracy": "0.9622", "provider": "clarifai"}, {"name": "nature", "accuracy": "0.99558", "provider": "clarifai"}, {"name": "outdoors", "accuracy": "0.96327", "provider": "clarifai"}, {"name": "pest", "accuracy": "0.98646", "provider": "clarifai"}, {"name": "spider", "accuracy": "0.94385", "provider": "clarifai"}, {"name": "wild", "accuracy": "0.94351", "provider": "clarifai"}, {"name": "wildlife", "accuracy": "0.98856", "provider": "clarifai"}, {"name": "wing", "accuracy": "0.95252", "provider": "clarifai"}]
SELECT 1
Time: 0.085s

(Note that a duplicate {'name': 'Female', 'provider': 'animaldiversity'}, was removed in the unique_tags 🎉)

I'm going to run a count for this to see how many we have 😄

AetherUnbound · 2024-02-10T19:36:15Z

@stacimc was able to run this for me:

deploy@localhost:openledger> select count(*) from (select identifier, provider, updated_on, tags || '[]'::jsonb as tags, (sele
 ct jsonb_agg(distinct x) from jsonb_array_elements(tags || '[]'::jsonb) t(x))  as unique_tags from image) u where jsonb_array
 _length(u.tags) > jsonb_array_length(u.unique_tags);
+--------+
| count  |
|--------|
| 206799 |
+--------+
SELECT 1
Time: 7422.500s (2 hours 3 minutes 42 seconds), executed in: 7422.493s (2 hours 3 minutes 42 seconds)

This feels like something we can correct for just those records! What do you think @obulat?

krysal · 2024-03-06T15:40:25Z

Before reading the replies, I added this to the DN milestone and thought it had more to do with modifying the MediaStore class to avoid duplicates. However, it seems this is more targeted at deleting the currently saved duplicates, and that can be done using the batched update DAG 😄 Is that correct, @AetherUnbound? It would be pretty cool to leverage the power of SQL 🔥

Should we create a new issue to filter out possible duplicates from provider scripts? I don't understand why raw_tags is a list of strings or dictionaries there.

openverse/catalog/dags/common/storage/media.py

Line 282 in f8971fd

raw_tags: List of strings or dictionaries

AetherUnbound · 2024-03-07T00:21:53Z

I don't know that I have the full context on raw_tags and why that took the structure it did, though I suspect it might be a dictionary because we occasionally had other values in there, like source and accuracy. Might be relevant for #431!

That said though, if we could 1) ensure that the MediaStore class is removing duplicate tags from the provider before saving and 2) remove the existing duplicates using the batched update DAG, I think that's a great path forward!

krysal · 2024-03-14T23:40:24Z

@AetherUnbound Glad we're on the same page! I created an issue for the code changes in the catalog: #3926.

Do any of you want to take charge of this? 😄 Since you all already discussed it and saw a potential and relatively quick solution. CC @obulat @stacimc

krysal · 2024-04-04T04:06:37Z

Preparing for once the code in the catalog is updated, I see two options to solve this:

Run the UPDATE query manually for both tables.

-- Remove tags duplicates from the audio table
WITH tags_temp AS (
	SELECT identifier, tags || '[]'::jsonb AS tags,
		(SELECT jsonb_agg(DISTINCT x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags
	FROM audio
)
UPDATE audio SET tags = tags_temp.unique_tags FROM tags_temp
WHERE tags_temp.identifier = audio.identifier
	AND jsonb_array_length(tags_temp.tags) > jsonb_array_length(tags_temp.unique_tags);

You can try this locally, as there are some duplicates in the sample data.

-- Get the number of duplicates
SELECT COUNT(*)
    FROM (
        SELECT identifier, provider, updated_on,
            tags || '[]'::jsonb AS tags,
            (SELECT jsonb_agg(DISTINCT x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags
        FROM audio
    ) u
    WHERE jsonb_array_length(u.tags) > jsonb_array_length(u.unique_tags);

-- and/or see one specifically, e.g.:
SELECT jsonb_array_elements(tags) FROM audio WHERE identifier='209937e6-5a7f-4c31-bdc2-c6e39c2b7cb3';

It seems feasible, especially for the smaller audio table, but given the amount of duplicates shown for images, it could also work in a reasonable time.

Modify the batched_update DAG to include an optional WITH statement for more complex DB operations. I tried this quickly as well and made it work, but it does not have the same flexibility as building the query manually, and it kind of feels tricky and dirty the way the DAG turns out (it needs to remove some input validations). Check my branch, improv/batched_update_with, and run the DAG with the following configuration to see that it has the same effect (make sure to recreate the catalog data if you run the SQL before, to fill the duplicates again).

{
    "query_id": "with_test",
    "table_name": "audio",
    "with_query": "WITH wt AS (SELECT identifier AS wt_identifier, tags || '[]'::jsonb AS current_tags, (SELECT jsonb_agg(distinct x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags FROM audio)",
    "select_query": "JOIN wt ON identifier = wt_identifier WHERE jsonb_array_length(wt.current_tags) > jsonb_array_length(wt.unique_tags)",
    "update_query": "SET tags = (SELECT jsonb_agg(distinct x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x))",
    "batch_size": 100,
    "update_timeout": 3600,
    "resume_update": false,
    "dry_run": false
}

So folks, do you prefer how to solve this or see any other alternative here?

stacimc · 2024-04-04T22:14:12Z

It's definitely nice to run the update through a DAG so we have a little more visibility, although not strictly necessary. I don't see the branch you mentioned so I haven't looked at your updates to the batched_update, but FWIW I believe the existing DAG can be made to work without changes using this select_query:

    WHERE identifier IN (
        SELECT identifier FROM (
            SELECT 
                identifier,
                tags || '[]'::jsonb AS tags,
                (select jsonb_agg(distinct x) from jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags
            FROM image
        ) u
        WHERE jsonb_array_length(u.tags) > jsonb_array_length(u.unique_tags)
    );

It's a bit more nested than I'd like but that worked for me locally to update the 362 images with duplicate tags in about 14 seconds. When I ran just the SELECT in production for image it took a little over 2 hours, so I hope that shouldn't be too terrible either 🤞

Either way we'll want to wait to run this until after confirming with @sarayourfriend that the catalog deployments are complete and DAGs can be unpaused.

krysal · 2024-04-05T17:31:23Z

@stacimc That's awesome! Better if we don't require modifications to the DAG, glad that you found a query that works! I'll mark this as blocked until the other work is finished.

krysal · 2024-04-11T23:58:20Z

Started the DagRun for audio with the following configuration that worked locally for me:

{
    "query_id": "rm_duplicate_tags_audio",
    "table_name": "audio",
    "select_query": "WHERE identifier IN (     SELECT identifier FROM (         SELECT identifier, tags || '[]'::jsonb AS tags, (select jsonb_agg(distinct x) from jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags FROM audio     ) u WHERE jsonb_array_length(u.tags) > jsonb_array_length(u.unique_tags) )",
    "update_query": "SET tags = (SELECT jsonb_agg(distinct x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x))",
    "batch_size": 10000,
    "update_timeout": 3600,
    "resume_update": false,
    "dry_run": false
}

krysal · 2024-04-12T00:07:24Z

It was super quick given there was only 106 rows with duplicate tags in the audio table, so I kicked off the run for the image table. The config:

{
    "query_id": "rm_duplicate_tags_image",
    "table_name": "image",
    "select_query": "WHERE identifier IN (     SELECT identifier FROM (         SELECT identifier, tags || '[]'::jsonb AS tags, (select jsonb_agg(distinct x) from jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags FROM image     ) u WHERE jsonb_array_length(u.tags) > jsonb_array_length(u.unique_tags) )",
    "update_query": "SET tags = (SELECT jsonb_agg(distinct x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x))",
    "batch_size": 10000,
    "update_timeout": 3600,
    "resume_update": false,
    "dry_run": false
}

krysal · 2024-04-12T15:02:39Z

This is done ✅

stacimc added 🟩 priority: low Low priority and doesn't need to be rushed 🛠 goal: fix Bug fix 💾 tech: postgres Involves PostgreSQL labels May 17, 2022

stacimc mentioned this issue May 17, 2022

Some tags are duplicated WordPress/openverse-frontend#1309

Closed

1 task

AetherUnbound mentioned this issue May 20, 2022

Audio waveform cache-warming Django command WordPress/openverse-api#529

Closed

1 task

krysal mentioned this issue Aug 22, 2022

Audit tags field for images #1557

Closed

sarayourfriend mentioned this issue Dec 12, 2022

RFC: Catalog data cleaning #345

Closed

2 tasks

obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 24, 2023

github-project-automation bot added this to Openverse Backlog Apr 17, 2023

github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Apr 17, 2023

obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023

krysal mentioned this issue Jun 5, 2023

Add DAG to remove Flickr thumbnails #2302

Merged

8 tasks

krysal added this to the Data normalization milestone Feb 20, 2024

krysal mentioned this issue Mar 14, 2024

Update raw_tags to avoid duplicates in the catalog #3926

Closed

AetherUnbound moved this from 📋 Backlog to 📅 To Do in Openverse Backlog Apr 1, 2024

krysal mentioned this issue Apr 3, 2024

Data normalization #430

Closed

krysal added the ⛔ status: blocked Blocked & therefore, not ready for work label Apr 5, 2024

openverse-bot moved this from 📅 To Do to ⛔ Blocked in Openverse Backlog Apr 5, 2024

krysal self-assigned this Apr 11, 2024

krysal removed the ⛔ status: blocked Blocked & therefore, not ready for work label Apr 11, 2024

openverse-bot moved this from ⛔ Blocked to 📋 Backlog in Openverse Backlog Apr 11, 2024

krysal moved this from 📋 Backlog to ✅ Done in Openverse Backlog Apr 11, 2024

krysal moved this from ✅ Done to 🏗 In Progress in Openverse Backlog Apr 11, 2024

krysal closed this as completed Apr 12, 2024

openverse-bot moved this from 🏗 In Progress to ✅ Done in Openverse Backlog Apr 12, 2024

AetherUnbound mentioned this issue Apr 24, 2024

Remove and de-duplicate tags with leading/trailing whitespace #4199

Closed

sarayourfriend mentioned this issue Jun 3, 2024

Decode and deduplicate tags during data refresh #4143

Closed

8 tasks

This was referenced Jun 6, 2024

Decode and deduplicate tags in the catalog with a TargetedReingestionDAG #4452

Open

Remove deny-listed tags in the catalog with the batched_update DAG #4453

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove duplicated tags #1566

Remove duplicated tags #1566

stacimc commented May 17, 2022

AetherUnbound commented Jan 24, 2024

obulat commented Feb 8, 2024

AetherUnbound commented Feb 8, 2024

obulat commented Feb 9, 2024

AetherUnbound commented Feb 9, 2024

AetherUnbound commented Feb 10, 2024

krysal commented Mar 6, 2024

AetherUnbound commented Mar 7, 2024

krysal commented Mar 14, 2024

krysal commented Apr 4, 2024

stacimc commented Apr 4, 2024

krysal commented Apr 5, 2024

krysal commented Apr 11, 2024

krysal commented Apr 12, 2024

krysal commented Apr 12, 2024

Remove duplicated tags #1566

Remove duplicated tags #1566

Comments

stacimc commented May 17, 2022

Problem

Description

Implementation

AetherUnbound commented Jan 24, 2024

obulat commented Feb 8, 2024

AetherUnbound commented Feb 8, 2024

obulat commented Feb 9, 2024

AetherUnbound commented Feb 9, 2024

AetherUnbound commented Feb 10, 2024

krysal commented Mar 6, 2024

AetherUnbound commented Mar 7, 2024

krysal commented Mar 14, 2024

krysal commented Apr 4, 2024

stacimc commented Apr 4, 2024

krysal commented Apr 5, 2024

krysal commented Apr 11, 2024

krysal commented Apr 12, 2024

krysal commented Apr 12, 2024