Add threshold image similarity scores #526

harrisonpim · 2022-08-12T18:19:40Z

This PR sets a threshold similarity score for images so that we can avoid displaying questionable matches. Closes #516

This change messes with some of our existing tests - the dummy index populated while setting up the tests can't produce the same scores as the fully populated prod index, and the thresholds therefore filter out any results which might have been matched. In some tests I've been able to manually set the minScore to 0, but in others which call the API itself, that's not possible. I feel like these things might be more appropriately tested in rank, where tests can be run against a full index and scoring can be properly examined.

alexwlchan · 2022-08-12T18:29:17Z

search/src/main/scala/weco/api/search/services/ImagesService.scala

+    val defaultMinScore: Double = similarityMetric match {
+      case SimilarityMetric.Blended  => 300
+      case SimilarityMetric.Features => 300
+      case SimilarityMetric.Colors   => 20
+    }


Can you explain how you came up with these default scores? Why is Colors so much lower?

Sure. There's a notebook in the data science repo which produced the results in wellcomecollection/platform#5581. Based on that analysis, we decided that ~300 seemed like an appropriate threshold for the blended similarity metric.

I re-ran that analysis with the state.inferredData.lshEncodedFeatures and state.inferredData.palette fields individually, and produced these corresponding graphs:

lshEncodedFeatures:

palette:

If my mental maths is right, the scores for colours are generally lower because the state.inferredData.palette field contains fewer, more commonly occurring terms.

Ah, super. Maybe just include a comment pointing to that ticket, so we can find this again in future?

harrisonpim and others added 4 commits August 12, 2022 19:02

add a minScore to the query, with appropriate values for each metric

3ed9acb

set minScore=0 in tests

e591b2c

remove matched images from test documents

4f6fb10

Apply auto-formatting rules

fa72318

harrisonpim requested review from alexwlchan and jamieparkinson and removed request for alexwlchan August 12, 2022 18:20

alexwlchan reviewed Aug 12, 2022

View reviewed changes

alexwlchan approved these changes Aug 22, 2022

View reviewed changes

harrisonpim added 2 commits August 22, 2022 10:24

Merge branch 'main' into similarity-threshold

0d71a0d

add a comment which links to the threshold analysis

da68a42

harrisonpim merged commit 8f06ae1 into main Aug 22, 2022

harrisonpim deleted the similarity-threshold branch August 22, 2022 09:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add threshold image similarity scores #526

Add threshold image similarity scores #526

harrisonpim commented Aug 12, 2022

alexwlchan Aug 12, 2022

harrisonpim Aug 12, 2022

alexwlchan Aug 22, 2022

Add threshold image similarity scores #526

Add threshold image similarity scores #526

Conversation

harrisonpim commented Aug 12, 2022

alexwlchan Aug 12, 2022

Choose a reason for hiding this comment

harrisonpim Aug 12, 2022

Choose a reason for hiding this comment

alexwlchan Aug 22, 2022

Choose a reason for hiding this comment