Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicated tags #1566

Closed
1 task
stacimc opened this issue May 17, 2022 · 15 comments
Closed
1 task

Remove duplicated tags #1566

stacimc opened this issue May 17, 2022 · 15 comments
Assignees
Labels
🛠 goal: fix Bug fix 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs 💾 tech: postgres Involves PostgreSQL

Comments

@stacimc
Copy link
Contributor

stacimc commented May 17, 2022

Problem

Here's an example on the frontend of an image with duplicated tags: https://search-staging.openverse.engineering/image/9fc28cae-ec9b-437a-a960-98f48db2cac8 (coffee, lol, lol).

Description

We should remove duplicated tags as part of the cleaning steps when loading ingested data.

For backfilling: Non-dated DAGs consume all the data on each run and ought to update old records on the next DAG run. For dated DAGs, we may need to write a backfill query.

Implementation

  • 🙋 I would be interested in implementing this feature.
@stacimc stacimc added 🟩 priority: low Low priority and doesn't need to be rushed 🛠 goal: fix Bug fix 💾 tech: postgres Involves PostgreSQL labels May 17, 2022
@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 24, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@AetherUnbound
Copy link
Collaborator

Here's another example: https://openverse.org/image/9ab8329c-8037-4fd4-8d55-8a52eb0dca90 (with a lot more duplicates).

@obulat
Copy link
Contributor

obulat commented Feb 8, 2024

@AetherUnbound, the catalog database for the item you shared contains duplicate tags:

"[{""name"": ""1768"", ""provider"": ""flickr""}, {""name"": ""1768"", ""provider"": ""flickr""}, {""name"": ""1908"", ""provider"": ""flickr""}, {""name"": ""1908"", ""provider"": ""flickr""}, {""name"": ""allthingsgold"", ""provider"": ""flickr""}, {""name"": ""allthingsgold"", ""provider"": ""flickr""}, {""name"": ""art"", ""provider"": ""flickr""}, {""name"": ""art"", ""provider"": ""flickr""}, {""name"": ""arthistory"", ""provider"": ""flickr""}, {""name"": ""arthistory"", ""provider"": ""flickr""}, {""name"": ""artist"", ""provider"": ""flickr""}, {""name"": ""artist"", ""provider"": ""flickr""}, {""name"": ""avirtualmuseum"", ""provider"": ""flickr""}, {""name"": ""avirtualmuseum"", ""provider"": ""flickr""}, {""name"": ""cardiff"", ""provider"": ""flickr""}, {""name"": ""cardiff"", ""provider"": ""flickr""}, {""name"": ""claudemonet"", ""provider"": ""flickr""}, {""name"": ""claudemonet"", ""provider"": ""flickr""}, {""name"": ""colection"", ""provider"": ""flickr""}, {""name"": ""colection"", ""provider"": ""flickr""}, {""name"": ""colour"", ""provider"": ""flickr""}, {""name"": ""colour"", ""provider"": ""flickr""}, {""name"": ""frenchimpressionist"", ""provider"": ""flickr""}, {""name"": ""frenchimpressionist"", ""provider"": ""flickr""}, {""name"": ""giorgio"", ""provider"": ""flickr""}, {""name"": ""giorgio"", ""provider"": ""flickr""}, {""name"": ""gold"", ""provider"": ""flickr""}, {""name"": ""gold"", ""provider"": ""flickr""}, {""name"": ""golden"", ""provider"": ""flickr""}, {""name"": ""golden"", ""provider"": ""flickr""}, {""name"": ""goldenlight"", ""provider"": ""flickr""}, {""name"": ""goldenlight"", ""provider"": ""flickr""}, {""name"": ""impressionism"", ""provider"": ""flickr""}, {""name"": ""impressionism"", ""provider"": ""flickr""}, {""name"": ""museum"", ""provider"": ""flickr""}, {""name"": ""museum"", ""provider"": ""flickr""}, {""name"": ""national"", ""provider"": ""flickr""}, {""name"": ""national"", ""provider"": ""flickr""}, {""name"": ""paint"", ""provider"": ""flickr""}, {""name"": ""paint"", ""provider"": ""flickr""}, {""name"": ""painter"", ""provider"": ""flickr""}, {""name"": ""painter"", ""provider"": ""flickr""}, {""name"": ""san"", ""provider"": ""flickr""}, {""name"": ""san"", ""provider"": ""flickr""}, {""name"": ""texture"", ""provider"": ""flickr""}, {""name"": ""texture"", ""provider"": ""flickr""}, {""name"": ""venice"", ""provider"": ""flickr""}, {""name"": ""venise"", ""provider"": ""flickr""}, {""name"": ""venise"", ""provider"": ""flickr""}, {""name"": ""w"", ""provider"": ""flickr""}, {""name"": ""w"", ""provider"": ""flickr""}, {""name"": ""wales"", ""provider"": ""flickr""}, {""name"": ""wales"", ""provider"": ""flickr""}]"

This should not happen with items that have been recently re-ingested because the tags column removes non-unique tags when upserting new values into the table:

TAGS = JSONColumn(
name="tags", required=False, upsert_strategy=UpsertStrategy.merge_jsonb_arrays
)

The item in question, however, was not updated since 2020-10-01, and I'm not sure whether the code to remove duplicate tags was ran at ingestion at that time.

What would the best fix for such items be in the catalog? Would running a batched update that would select items that have not been updated since some date in the past, and run something like SELECT jsonb_agg(DISTINCT x) FROM jsonb_array_elements(tags) on them be a good idea?

@AetherUnbound
Copy link
Collaborator

What would the best fix for such items be in the catalog? Would running a batched update that would select items that have not been updated since some date in the past, and run something like SELECT jsonb_agg(DISTINCT x) FROM jsonb_array_elements(tags) on them be a good idea?

That sounds like it could work! The only issue is the number of records that might be affected by that update could be significant 🤔 Do you think we could get a count of the records that have duplicates in them before trying to run any batched update? That might give us a sense of how pervasive the issue is.

@obulat
Copy link
Contributor

obulat commented Feb 9, 2024

Do you think we could get a count of the records that have duplicates in them before trying to run any batched update?

The only way I can think of getting this information is by adding a new step to the data cleanup in the ingestion server. But this would make the data refresh run longer.

I wonder how many items were not updated since 2020? Maybe the non-updated set of items is smaller?

@AetherUnbound
Copy link
Collaborator

After some fiddling around, it looks like we are able to query for this!!

deploy@localhost:openledger> 
    SELECT *
    FROM (
        SELECT identifier, provider, updated_on,
            tags || '[]'::jsonb AS tags,
            (SELECT jsonb_agg(DISTINCT x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags FROM image
    ) u
    WHERE jsonb_array_length(u.tags) > jsonb_array_length(u.unique_tags) limit 1;
-[ RECORD 1 ]-------------------------
identifier  | 95aeace6-f70d-420c-a08e-84271ba6cad3
provider    | animaldiversity
updated_on  | 2019-02-15 14:47:33.392219+00
tags        | [{"name": "Reproduction", "provider": "animaldiversity"}, {"name": "Body Parts", "provider": "animaldiversity"}, {"name": "Photo", "provider": "animaldiversity"}, {"name": "Live Animal", "provider": "animaldiversity"}, {"name": "Female", "provider": "animaldiversity"}, {"name": "Female", "provider": "animaldiversity"}, {"name": "Adult/Sexually Mature", "provider": "animaldiversity"}, {"name": "Reproductive Organ", "provider": "animaldiversity"}, {"name": "insect", "accuracy": "0.99904", "provider": "clarifai"}, {"name": "dragonfly", "accuracy": "0.99758", "provider": "clarifai"}, {"name": "nature", "accuracy": "0.99558", "provider": "clarifai"}, {"name": "fly", "accuracy": "0.99301", "provider": "clarifai"}, {"name": "animal", "accuracy": "0.99085", "provider": "clarifai"}, {"name": "wildlife", "accuracy": "0.98856", "provider": "clarifai"}, {"name": "pest", "accuracy": "0.98646", "provider": "clarifai"}, {"name": "invertebrate", "accuracy": "0.9839", "provider": "clarifai"}, {"name": "leaf", "accuracy": "0.97928", "provider": "clarifai"}, {"name": "damselfly", "accuracy": "0.96964", "provider": "clarifai"}, {"name": "outdoors", "accuracy": "0.96327", "provider": "clarifai"}, {"name": "little", "accuracy": "0.9622", "provider": "clarifai"}, {"name": "wing", "accuracy": "0.95252", "provider": "clarifai"}, {"name": "close", "accuracy": "0.95173", "provider": "clarifai"}, {"name": "spider", "accuracy": "0.94385", "provider": "clarifai"}, {"name": "wild", "accuracy": "0.94351", "provider": "clarifai"}, {"name": "closeup", "accuracy": "0.94158", "provider": "clarifai"}, {"name": "garden", "accuracy": "0.93974", "provider": "clarifai"}, {"name": "entomology", "accuracy": "0.93532", "provider": "clarifai"}]
unique_tags | [{"name": "Adult/Sexually Mature", "provider": "animaldiversity"}, {"name": "Body Parts", "provider": "animaldiversity"}, {"name": "Female", "provider": "animaldiversity"}, {"name": "Live Animal", "provider": "animaldiversity"}, {"name": "Photo", "provider": "animaldiversity"}, {"name": "Reproduction", "provider": "animaldiversity"}, {"name": "Reproductive Organ", "provider": "animaldiversity"}, {"name": "animal", "accuracy": "0.99085", "provider": "clarifai"}, {"name": "close", "accuracy": "0.95173", "provider": "clarifai"}, {"name": "closeup", "accuracy": "0.94158", "provider": "clarifai"}, {"name": "damselfly", "accuracy": "0.96964", "provider": "clarifai"}, {"name": "dragonfly", "accuracy": "0.99758", "provider": "clarifai"}, {"name": "entomology", "accuracy": "0.93532", "provider": "clarifai"}, {"name": "fly", "accuracy": "0.99301", "provider": "clarifai"}, {"name": "garden", "accuracy": "0.93974", "provider": "clarifai"}, {"name": "insect", "accuracy": "0.99904", "provider": "clarifai"}, {"name": "invertebrate", "accuracy": "0.9839", "provider": "clarifai"}, {"name": "leaf", "accuracy": "0.97928", "provider": "clarifai"}, {"name": "little", "accuracy": "0.9622", "provider": "clarifai"}, {"name": "nature", "accuracy": "0.99558", "provider": "clarifai"}, {"name": "outdoors", "accuracy": "0.96327", "provider": "clarifai"}, {"name": "pest", "accuracy": "0.98646", "provider": "clarifai"}, {"name": "spider", "accuracy": "0.94385", "provider": "clarifai"}, {"name": "wild", "accuracy": "0.94351", "provider": "clarifai"}, {"name": "wildlife", "accuracy": "0.98856", "provider": "clarifai"}, {"name": "wing", "accuracy": "0.95252", "provider": "clarifai"}]
SELECT 1
Time: 0.085s

(Note that a duplicate {'name': 'Female', 'provider': 'animaldiversity'}, was removed in the unique_tags 🎉)

I'm going to run a count for this to see how many we have 😄

@AetherUnbound
Copy link
Collaborator

@stacimc was able to run this for me:

deploy@localhost:openledger> select count(*) from (select identifier, provider, updated_on, tags || '[]'::jsonb as tags, (sele
 ct jsonb_agg(distinct x) from jsonb_array_elements(tags || '[]'::jsonb) t(x))  as unique_tags from image) u where jsonb_array
 _length(u.tags) > jsonb_array_length(u.unique_tags);
+--------+
| count  |
|--------|
| 206799 |
+--------+
SELECT 1
Time: 7422.500s (2 hours 3 minutes 42 seconds), executed in: 7422.493s (2 hours 3 minutes 42 seconds)

This feels like something we can correct for just those records! What do you think @obulat?

@krysal krysal added this to the Data normalization milestone Feb 20, 2024
@krysal
Copy link
Member

krysal commented Mar 6, 2024

Before reading the replies, I added this to the DN milestone and thought it had more to do with modifying the MediaStore class to avoid duplicates. However, it seems this is more targeted at deleting the currently saved duplicates, and that can be done using the batched update DAG 😄 Is that correct, @AetherUnbound? It would be pretty cool to leverage the power of SQL 🔥

Should we create a new issue to filter out possible duplicates from provider scripts? I don't understand why raw_tags is a list of strings or dictionaries there.

raw_tags: List of strings or dictionaries

@AetherUnbound
Copy link
Collaborator

I don't know that I have the full context on raw_tags and why that took the structure it did, though I suspect it might be a dictionary because we occasionally had other values in there, like source and accuracy. Might be relevant for #431!

That said though, if we could 1) ensure that the MediaStore class is removing duplicate tags from the provider before saving and 2) remove the existing duplicates using the batched update DAG, I think that's a great path forward!

@krysal
Copy link
Member

krysal commented Mar 14, 2024

@AetherUnbound Glad we're on the same page! I created an issue for the code changes in the catalog: #3926.

Do any of you want to take charge of this? 😄 Since you all already discussed it and saw a potential and relatively quick solution. CC @obulat @stacimc

@krysal
Copy link
Member

krysal commented Apr 4, 2024

Preparing for once the code in the catalog is updated, I see two options to solve this:

  1. Run the UPDATE query manually for both tables.
-- Remove tags duplicates from the audio table
WITH tags_temp AS (
	SELECT identifier, tags || '[]'::jsonb AS tags,
		(SELECT jsonb_agg(DISTINCT x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags
	FROM audio
)
UPDATE audio SET tags = tags_temp.unique_tags FROM tags_temp
WHERE tags_temp.identifier = audio.identifier
	AND jsonb_array_length(tags_temp.tags) > jsonb_array_length(tags_temp.unique_tags);

You can try this locally, as there are some duplicates in the sample data.

-- Get the number of duplicates
SELECT COUNT(*)
    FROM (
        SELECT identifier, provider, updated_on,
            tags || '[]'::jsonb AS tags,
            (SELECT jsonb_agg(DISTINCT x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags
        FROM audio
    ) u
    WHERE jsonb_array_length(u.tags) > jsonb_array_length(u.unique_tags);

-- and/or see one specifically, e.g.:
SELECT jsonb_array_elements(tags) FROM audio WHERE identifier='209937e6-5a7f-4c31-bdc2-c6e39c2b7cb3';

It seems feasible, especially for the smaller audio table, but given the amount of duplicates shown for images, it could also work in a reasonable time.

  1. Modify the batched_update DAG to include an optional WITH statement for more complex DB operations. I tried this quickly as well and made it work, but it does not have the same flexibility as building the query manually, and it kind of feels tricky and dirty the way the DAG turns out (it needs to remove some input validations). Check my branch, improv/batched_update_with, and run the DAG with the following configuration to see that it has the same effect (make sure to recreate the catalog data if you run the SQL before, to fill the duplicates again).
{
    "query_id": "with_test",
    "table_name": "audio",
    "with_query": "WITH wt AS (SELECT identifier AS wt_identifier, tags || '[]'::jsonb AS current_tags, (SELECT jsonb_agg(distinct x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags FROM audio)",
    "select_query": "JOIN wt ON identifier = wt_identifier WHERE jsonb_array_length(wt.current_tags) > jsonb_array_length(wt.unique_tags)",
    "update_query": "SET tags = (SELECT jsonb_agg(distinct x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x))",
    "batch_size": 100,
    "update_timeout": 3600,
    "resume_update": false,
    "dry_run": false
}

So folks, do you prefer how to solve this or see any other alternative here?

@stacimc
Copy link
Contributor Author

stacimc commented Apr 4, 2024

It's definitely nice to run the update through a DAG so we have a little more visibility, although not strictly necessary. I don't see the branch you mentioned so I haven't looked at your updates to the batched_update, but FWIW I believe the existing DAG can be made to work without changes using this select_query:

    WHERE identifier IN (
        SELECT identifier FROM (
            SELECT 
                identifier,
                tags || '[]'::jsonb AS tags,
                (select jsonb_agg(distinct x) from jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags
            FROM image
        ) u
        WHERE jsonb_array_length(u.tags) > jsonb_array_length(u.unique_tags)
    );

It's a bit more nested than I'd like but that worked for me locally to update the 362 images with duplicate tags in about 14 seconds. When I ran just the SELECT in production for image it took a little over 2 hours, so I hope that shouldn't be too terrible either 🤞

Either way we'll want to wait to run this until after confirming with @sarayourfriend that the catalog deployments are complete and DAGs can be unpaused.

@krysal
Copy link
Member

krysal commented Apr 5, 2024

@stacimc That's awesome! Better if we don't require modifications to the DAG, glad that you found a query that works! I'll mark this as blocked until the other work is finished.

@krysal krysal added the ⛔ status: blocked Blocked & therefore, not ready for work label Apr 5, 2024
@krysal krysal self-assigned this Apr 11, 2024
@krysal krysal removed the ⛔ status: blocked Blocked & therefore, not ready for work label Apr 11, 2024
@krysal
Copy link
Member

krysal commented Apr 11, 2024

Started the DagRun for audio with the following configuration that worked locally for me:

{
    "query_id": "rm_duplicate_tags_audio",
    "table_name": "audio",
    "select_query": "WHERE identifier IN (     SELECT identifier FROM (         SELECT identifier, tags || '[]'::jsonb AS tags, (select jsonb_agg(distinct x) from jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags FROM audio     ) u WHERE jsonb_array_length(u.tags) > jsonb_array_length(u.unique_tags) )",
    "update_query": "SET tags = (SELECT jsonb_agg(distinct x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x))",
    "batch_size": 10000,
    "update_timeout": 3600,
    "resume_update": false,
    "dry_run": false
}

@krysal
Copy link
Member

krysal commented Apr 12, 2024

It was super quick given there was only 106 rows with duplicate tags in the audio table, so I kicked off the run for the image table. The config:

{
    "query_id": "rm_duplicate_tags_image",
    "table_name": "image",
    "select_query": "WHERE identifier IN (     SELECT identifier FROM (         SELECT identifier, tags || '[]'::jsonb AS tags, (select jsonb_agg(distinct x) from jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags FROM image     ) u WHERE jsonb_array_length(u.tags) > jsonb_array_length(u.unique_tags) )",
    "update_query": "SET tags = (SELECT jsonb_agg(distinct x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x))",
    "batch_size": 10000,
    "update_timeout": 3600,
    "resume_update": false,
    "dry_run": false
}

@krysal
Copy link
Member

krysal commented Apr 12, 2024

This is done ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🛠 goal: fix Bug fix 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs 💾 tech: postgres Involves PostgreSQL
Projects
Archived in project
Development

No branches or pull requests

4 participants