This repository contains a project representing full life cycle of implementing and deploying machine learning models from scratch to the production:
-
It starts from creting
web scraper
for getting an information about real estate listings in Serbia. Scraped data is being directly stored inPostgreSQL container
via pipelines defined inscrapy
project structure.Rotating proxies
androtating user agents
are used in a combination with aautothrottling
(requests are automatically adjusted based on website's server response time) so we're not banned from a websites. -
After that,
raw data has been cleaned
. Various ambiguous information is sorted out using common sense and some business knowlegde on real esatate market in Serbia and its cities. We're alsoanalysing, visualizing, extracting
an insights that are going to be valuable for implementing machine learning models. -
When it comes to the modeling section, we're using
custom implementation
of:multiple linear regression with gradient descent
multiclass (one-to-many) kernel SVM
-
When we trained the models, simple
web app
is created for demonstrating model capabilities to the end user. Application has been containerized withDocker
. For creating a web app we're usingstramlit python library
. -
CI/CD
pipeline is created withTravis CI
for building a container for our web application with docker and deploying it to aDocker hub
.
Described project is implemented in Python 3.8. This implementation was done as project work on the course Fiding the hidden knowledge (Machine learning) on Master's degree in Software Engineering.
In this repository under the src
directory you may find separate projects for each of the steps that are previously described. Under those project, in README.md
file, you may find the documentation, instructions on how to set up and use the code. Those project are:
- database - contains the instruction on how to set up and run the perzistent PostgreSQL database container, scripts and backups.
- web scraping - contains the scrapy project and instrunctions on how to run the spiders for getting the data about real estate listings from the websites and storring them into the database.
- data analysis - contains the Jupyter notebooks for claening the raw dataset, analysing and visualizing data.
- modeling - contains the implementation of a custom linear regression and SVM classification models, description how models work and Jupyter notebooks for training those models and additional notebooks for training a similar models from sklearn library as a baseline models for an evaluation and a comparison.
- streamlit app - contains the simple containerized web application for demonstrating trained models.
In the root directory you may also find:
requirements.txt
file - Python 3.8 dependencies for running all of the projects. Please refer to theREADME.md
documentation uder each project on how to run them..travis.yml
file - forTravis CI
to automatically build and deploy the stramlit web app container directly to the Docker Hub, so users could pull the image and use it.
Here is the demo of the final web application using our trained models for making a prediction on a flat price based on a user inputs: