Given an image, find other similar images from a learned set of images.
See CBIR.ipynb for the implementation.
- Find similar images (or duplicates)
- Find source image (e.g. which directory / link it originated from) regardless of if the image has the same name or not
- If image has multiple versions at different resolutions, can find all of them
- Copyright violation detection
- No human metadata labeling required
- Analyze images based on shapes / tones / textures w/in each image
- Slow, especially on large image set
- Requires some hyperparameter tuning
- Likely does not work well on images that have been flipped or rotated
- Google Image Search
- TinEye
- image-match (Opensource but no longer maintained)
- ImageHash (Opensource, has multiple image hashing algorithms)
- To see if I could
- Privacy: allows me to search similar images from my private albums, instead of uploading to some web service by Google or TinEye
- The advantages of both
image-match
andImageHash
are that they are much faster than this implementation due to simpler hashing / signature calculations - This implementation maybe more accurate than the above opensource implementation since the low-dimensional embeddings were tuned specifically for each dataset, i.e. data points (images) more similar to each other are moved closer and points more different are pushed further apart in the low-Dimensional manifold
- Additionally, the former implementations were designed specifically for images, while this approach can be re-tooled to be applied to any data type (e.g. audio or texts) as long as they can be converted to numbers
A CBIR
class is built to the following framework:
-
Walk the path, seach and store all image names
-
Build an image array with
PIL
by processing each image as followed:- Convert to greyscale
- Resize to 64x64
- Reshape to a 1-Dimensional array (i.e. shape 1x4096)
-
Fit UMAP (Uniform Manifold Approximation & Projection) algorithm with the following hyperparemeters
- w/ specified
n_neighs
neighbors (this is to be tuned) - w/ specified
n_comps
components (i.e. dimensions) (select an appropriate number based on speed and compute power availability) min_d
= 0.0 formin_dist
(pre-set so that similar points clump together)
- w/ specified
-
Tune the
n_neighs
hyperparameter by calculating thetrustworthiness
score- The trustworthiness score ranges from 0 to 1, with the higher end indicating a better lower dimensional representation of the dataset
- Plot trustworthiness score versus k neighbors
- Select an appropriate k neighbors
- Trustworthiness Score
-
Plug the
n_neighs
=k
neighbors from the previous step to the final model -
Save final model
-
Query an image against the built dataset as desired