Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add threshold image similarity scores #526

Merged
merged 6 commits into from
Aug 22, 2022
Merged

Conversation

harrisonpim
Copy link
Contributor

This PR sets a threshold similarity score for images so that we can avoid displaying questionable matches. Closes #516

This change messes with some of our existing tests - the dummy index populated while setting up the tests can't produce the same scores as the fully populated prod index, and the thresholds therefore filter out any results which might have been matched. In some tests I've been able to manually set the minScore to 0, but in others which call the API itself, that's not possible. I feel like these things might be more appropriately tested in rank, where tests can be run against a full index and scoring can be properly examined.

@harrisonpim harrisonpim requested review from alexwlchan and jamieparkinson and removed request for alexwlchan August 12, 2022 18:20
Comment on lines +57 to +61
val defaultMinScore: Double = similarityMetric match {
case SimilarityMetric.Blended => 300
case SimilarityMetric.Features => 300
case SimilarityMetric.Colors => 20
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how you came up with these default scores? Why is Colors so much lower?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. There's a notebook in the data science repo which produced the results in wellcomecollection/platform#5581. Based on that analysis, we decided that ~300 seemed like an appropriate threshold for the blended similarity metric.

I re-ran that analysis with the state.inferredData.lshEncodedFeatures and state.inferredData.palette fields individually, and produced these corresponding graphs:

lshEncodedFeatures:
2c12fc1b-c57d-490f-83fd-7590149831de

palette:
9c4a9ae3-49d1-4834-a769-f56f32e1bc1c

If my mental maths is right, the scores for colours are generally lower because the state.inferredData.palette field contains fewer, more commonly occurring terms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, super. Maybe just include a comment pointing to that ticket, so we can find this again in future?

@harrisonpim harrisonpim merged commit 8f06ae1 into main Aug 22, 2022
@harrisonpim harrisonpim deleted the similarity-threshold branch August 22, 2022 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add a threshold score for visually similar images
3 participants