Skip to content

Latest commit

 

History

History
175 lines (91 loc) · 6.96 KB

README.md

File metadata and controls

175 lines (91 loc) · 6.96 KB

Eye BnB (www.eyebnb.com)

Business Prospect

• This project helps users find their favorite airbnb lodging during travel by image similarity based recommendation algo

• It provides value to:

o People who are looking for stylish, funky, fancy airbnb apartment(s) for their travel

o Apartment owners who want to know other apartments in certain areas which look like theirs so that they can make their apartments either look unique or close to the apartments already popular.

Data sources

• InsideAirbnb, a dataset published by Airbnb which contains a full list of all apartments in the U.S. (http://insideairbnb.com/get-the-data.html)

• Images and other info of Airbnb apartments in Boston were scraped by urllib2

Data Preparation

Online collection

o Basic info of Airbnb apartments (location, descriptions, price range, webpage URL) were retrieved from InsideAirbnb and saved to MongoDB running on an EC2 instance

Airbnb_web_scrape_script.py :
Read urls from airbnb apartment list file(Downloaded from http://insideairbnb.com/get-the-data.html), scrape desired apts infos and save to json files

o Apartment images were scraped from Airbnb and saved to S3

Airbnb_Image_scrape_script.py:
Read json files generated by the step above, fetch apt images and save them locally

o Highlight tags and user reviews were scraped in the mean time and saved for potential use in the future

Offline processing

o For each apartment image, Image features were extracted and pickled to csv file


Gist_feature_extraction_script.py:
Extract gist features from images, save to a pickle file locally along with meta data

HSV_feature_extraction_script.py:
Extract HSV features from images, save to a pickle file locally along with meta data

Feature_combination_script.py: 
Merge the above two features into one single feature, save to a picke file locally along with meta data

o Put highlight and reviews to csv file for further uses

Modeling

Image feature extraction (see the referrence at the bottom for details)

o HSV-Histgram

For each image, an HSV-Histgram feature was extracted by firstly decomposing the image to H/S/V channels, generating histogram for each channel and concating the three histograms. (90 buckets for each channel, producing a single feature vector of 270 long) HSV is said to be able to reflect how people truly perceive color info. HSV-Histgram maily captures color info of an image.

o GIST

First of all, a set of filters were employed to convolve over the entire image, producing a bunch of convolved images. And then each convolved image was divided into 4 * 4 subimages over which mean value was calculated. Finally all the mean values were put together, producing a single vector called GIST feature. GIST mainly capture the high level characteristics for each image, such as structural information, plain or texture, providing a holistic view for each image.

o CNN(to do)

Transfer learning methodology will be a way of feature extractor. Specifically,the output of the layer before softmax of a well-trained Inception-v3 model(on ImageNet) may be a feasible image feature.

o others(to do)

Similarities calculation

o Cosine Distance

Cosine Distance was used because it is immune to the difference of feature magnitudes

Evaluation - conceptually and quantitatively

Challenge: how can we know if the feature works for measuring image similarity, considering that it's an unsupervised task and we don't have true labels for images we scraped from Airbnb

Solution: test it on another labeled dataset which is similar to Airbnb's scenarios

Dataset:Image features and similarity measurements were tested on a subset of Indoor Scene Recognition dataset from MIT. For more details: http://web.mit.edu/torralba/www/indoor.html

Examples in the dataset:

Florist: Dining room:

Assumption: features of images in the same category(bedroom/bookstore/florist/dining room..) should be closer to each other than that from different categories. Quantitatively speaking, the higher the ratio of dis(within clusters)/dis(between clusters) , the better the feature is.

Calinski_harabaz was used to quantitatively validate the features. For comparation, a baseline case was fabricated by shuffling the image labels to make them random. The performance was as follows:

Calinski_harabaz for the real case is 22.816

Calinski_harabaz for the random case is 0.986

It's clear that Calinski_harabaz for the real case is quite higher than that of the baseline case, indicating that the feature works for representing images and it's feasible to use it to measuring image similarity

Calinski_harabaz: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabaz_score.html

Deployment

• Step1: given a picture, the system returns places look similar to it (in the same style)

• (backlog) Step2: users can add some word descriptions to specify where they dream of going ( industrial style, splendid..) to refine the result

Results:

Given the image on the upper left, the following apartments are recommended:

With the same picture, the following apartments are regarded as dissimilar:

System diagram

Techiques

• MongoDB

• AWS EC2 and S3

• flask

• (To do) Keras/TensorFlow

References for the techniques involved in the project:

Gist feature:

• MatthijsDouze,HervéJégou,SandhawaliaHarsimrat,LaurentAmsaleg,CordeliaSchmid. Evaluation of GIST descriptors for web-scale image search. CIVR 2009 - International Conference on Image and Video Retrieval, Jul 2009, Santorini, Greece. ACM, pp.19:1-8, 2009

• Gist/Context of a Scene: http://ilab.usc.edu/siagian/Research/Gist/Gist.html

• A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV, 42(3):145–175, 2001.

Transfer learning:

https://arxiv.org/pdf/1310.1531v1.pdf

Similarity measurement:

• Similarity measurement between images: https://ieeexplore.ieee.org/document/1508081/

Clustering evaluation:

http://www.ims.uni-stuttgart.de/institut/mitarbeiter/schulte/theses/phd/algorithm.pdf

• Calinski-Harabasz Index and Boostrap Evaluation with Clustering Methods: http://ethen8181.github.io/machine-learning/clustering_old/clustering/clustering.html

Packages/librarys:

Pymongo, cv2, fftw3, urlib2, boto3,SciPy, Sklearn, PIL, flask