-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate problematic visually similar images #5581
Comments
some things to discern and visualise:
|
scoring the problematic matches:
|
We show 6 similar matches in the image modal. Scores for the top 6 for another 1000 randomly chosen images:
|
Without seeing examples of images with different scores, just from the two examples and the percentiles it seems like our threshold is far too low? I would rather we didn't display similar images when we can't hit a high threshold, rather than always try to include them and end up with a poor set of suggestions. |
summarising what i spoke about with @jtweed earlier Here's the The palette contributions are a much smaller part of the total score than we expected. For the other example (dwhuv3ph / cg7hzgv8), the result is even more extreme. Because the results aren't particularly visually similar, my hypothesis is that we're seeing a lot of bad matches in the lsh features. This might be happening because we use k-means clustering, which doesn't allow weakly connected points to fall outside the clusters and remain unlabelled. The sklearn docs of this effect in action. If we imagine a super simple dataset with two major clusters in a 2d feature space: Here's a very basic illustration of what we get with k-means and what we should really be looking for Switching from k-means to OPTICS or DBSCAN would allow us to keep all of our existing query patterns in place while significantly limiting the number of poor matches within each feature subspace, and thereby limit the number of visually dissimilar results on the site. |
We've had a couple of reports of problematic visually similar images.
The images in question don't look particularly similar to a human, but do have some limited colour similarities that should not be enough to have them appear as visually similar images.
We should perform an initial investigation as to why these images are appearing and what we can do to improve the model. We may need to also increase the threshold required to display an image as visually similar.
Slack thread: https://wellcome.slack.com/archives/C8X9YKM5X/p1658310838492189
The text was updated successfully, but these errors were encountered: