- Created a tool that predicts average experience required for a data science position (MAE: ~2 years), to help prospective applicants find and apply to relevant and attainable positions.
- Scraped over 70 pages from Naukri using Python, BeautifulSoup and Selenium.
- Engineered features from the tags provided by the company on their job postings to quantify the value that companies put on the most popular skills in the field.
- Optimized Linear, Lasso and Random Forest regressors with GridSearchCV to attain the best performing model.
- Built a client facing API using Flask.
Python version: 3.11
Packages: Pandas, Numpy, Scikit-Learn, Matplotlib, Seaborn, Selenium, BeautifulSoup, Flask, Json, Pickle
Web Framework Requirements: pip install -r requirements.txt
Scraper Article: https://medium.com/analytics-vidhya/scraping-job-aggregator-site-naukri-com-using-python-and-beautiful-soup-a08a2046639b
Flask Productionization: https://towardsdatascience.com/productionize-a-machine-learning-model-with-flask-and-heroku-8201260503d2
I learnt the basics of building a scraper from the article mentioned above and modified it to scrape all 78 pages of data science job postings. The following features were scraped:
- Job URL
- Job title
- Number of reviews
- Company rating
- Company name
- Experience required
- Salary
- Location
- Days since posted
- Tags
After scraping the data, it required cleaning so that it could provide insights and be usable for the model. I made the following changes and created the following features:
- Simplified job column.
- Added a column stating the seniority of the position.
- Added a column stating whether salary is mentioned or not.
- Parsed the numeric data out of 'Number of reviews' column.
- Parsed out min and max experience and created an average experience required column.
- Created columns for job state.
- Made new columns for skills mentioned in the tags.
- Made a column for the top 20 companies with the most job postings.
I analysed the distributions of the data and the value counts for the categorical columns. And made a WordCloud for the most frequent keywords appearing in the job tags. Below are a few highlights from the EDA:
I transformed the categorical features to dummy variables. I used 3-fold Cross Validation for performance evaluation.
I built three different models and evaluated them using Mean Absolute Error, as it is relatively easy to interpret, and outliers aren't such an issue here.
The models I tried:
- Multivariate Linear Regression - Model baseline.
- Lasso Regression - Since Lasso Regression is normalized, it could be useful here as the data is sparse due to the many categorical variables.
- Random Forest - This should be good because of the sparsity as well. Plus, RFs tend to have relatively decent performance on most data.
Lasso Regression performed the best on this data.
- Lasso Regression MAE: 2.14
- Random Forest Regressor MAE: 2.21
- Multivariate Linear Regression MAE: 2.37
I built a flask API endpoint that was hosted on a local server by following the tutorial provided in the reference. The API endpoint takes in a request with a list of values from a job listing and returns an estimated value for average experience.