Data Pre Processing Notes:
-
Pandas library is super important for data pre processing.
-
pd.read_csv(‘path/to/dataset’) creates a dataframe .
-
In any dataset with which you train a model
- Features/Independant Variable: columns with which we predict dependant variable. Mostly present in first columns.
- Labels/Dependant variable factor: This is usually in last column of dataset and is something which has to be predicted.
-
Missing data can be harmful for training the data, so in turn sometimes we end up filling it using various tools. (Like missing salary can be replaced by average salaries)
-
Encoding categorical data:
- Sometimes there are some string categories of data which is difficult for the machine to understand so we encode this data.One simple way is to give numbers to each data. For eg Spain :1, Germany: 2 etc.
-
Better way is one hot encoding is turning the country column to three dfifferent something like. [1,0,0]: Spain [0,1,0]: Germany This will be easier for machine to interpret
-
Label encoding: dependant variable we use Label Encoder.
-
When should be feature scaling done?
- It should be done after splitting the dataset into train and test.
- Feature scaling is taking the variables in same scale.
- Test set is supposed to be brand new, so should not be worked before training.
- Feature scaling works on the data before training
-
Feature Scaling:
- Allows us to put all the features to same scale a. Standardisation (- 3 to +3): it works well all the time b. Normalisation (0 to 1): recommended when distribution is normal
Go with Standardisation 10. Only apply transform on test dataset
Unlike classification where we predict a continuous number, we use classification to classify the the objects in to different category. It finds its use cases in a lot of medical and marketing fields.
Some main Classification problems are:
- Logistic Regression
- K nearest Neighbors
- Support Vector Machine
- Kernel SVM
- Naive Bayes
- Decision Tree Classification
- Random Forest Classification
Applying Linear regression to classification problem.
We stage a likelihood of the person taking an offer.
Linear Regression at least gives us the range of people taking the offer. The line in between 1 and 0 makes sense but not the line above or below it.
We cut of the lines from top and below. Apply a sigmoid function to /y = mx +c/
An example with various age groups›
We use the probability to have a score. But what if we don’t want probability and ask for prediction.
Anything having probability of less than 0.5 its projected to 0 and anything above is projected upward.
Rule Guide:
- Choose the number K of neighbours, default values is 5.
- Take Manhattan distance or euclidian distance.
- Count the data points in different category.
- Assign new data point to the category where you counted most neighbours.
Euclidean distance
Take Euclidean distance from 5 points and assign the category.
SVM tries to pick extreme cases of categories which is risky. If we are differentiating between apples and oranges most of algorithms will look at only most common features SVM looks at the boundary conditions and tries to create a Line to separate them.
What if the Data cannot be separated Linearly. In that case we use Kernel SVM
Map data into Linearly separable dataset using Higher Dimension.
Post applying some function, Hyperplane separates the data.
If landmark is large then we get a value very close to zero. If the landmark is closer to reference point it be smaller and e^0 is 1.
We use the kernel to separate our data.
Anything out side circle will assigned 0, anything outside will be 1. Sigma defines how wide can the circumference of cycle can be. By finding the right sigma we find the distinction.
Types of kernel function:
If we project hyperplane which is same as running a linear model in 3 D. We use a RBF vector to create a 3 D plot of data and run Hyplerplane to get minimum error
How can the above be applied to train a model.
Splits in such a way to maximise a category in each split. It is very similar to Regression part of this, difference being in the algorithm to classify or regress.
Ensemble learning is when you take a lot of models and train them together and take their average. For eg We can have multiple points on the basis of which we can device Decision tress who’s combined average gives us a comparatively good result.
Move the centroids
When no new re assignments take place we can assume the algorithm Has converged.
The selection of centroid can hinder the selection of clusters. There is modification in K Means clustering algorithm which is k means ++ algorithm.
When there is only 1 cluster. The value of WCSS will be very large.
When clusters are increased to 2. The WCSS decreases.
When the number clusters is 3.
We can have as many clusters, but how to find optimal fit.
When the drop in WCSS becomes less, We see an elbow point which is the number of clusters required for data modelling
Regression model are used for predicting future values of a particular nature.
Y = b0 + b1*X1
Y: Dependant Variable(Something you try to understand how is it dependant on something) X: Independent Variable which might and might not affect the dependant variable b1: Coefficient (connector between y and x) b0: Constant term.
If we are trying to figure out salary for x work ex: Salary = b0 + b1 * Experience
Constant : Intersection on Y axis (Starting salary) B1: Slope of the line.
Simple Linear Regression:
Simple Linear Regression will make many such lines from actual to assumed and calculate
SUM(y - y`)^2 -> min
The Line represents the predicted line trying to fit through test dataset.
Much more coefficients as compared to Linear regression Y = b0 + b1x1 + b2x2 + b3*x3…
There are some Independent variables or features which we have to throw out to increase the accuracy of the model. Only important features should be selected.
The value at which we discard a feature and assume its not adding any value to model.
Methods to select the correct features:
- All in
- Backward Elimination
- Forward Selection
- Bidirectional Elimination
- Score Comparison
Stepwise regression is Backward, Forward and Bidirectional.
Throw all the variables to build the model. It is used for prepping for Backward Elimination
- Select significance level to stay in the model. 5%
- Fit the full model with all possible predictors
- Consider the predictor with the highest P value
- If p > SL go to step 4 else fin.
- Remove predictor
- Fit the model without this variable.
- Go to step 3
- Select significance level to 5 %
- Fit all the simple regression models. Select 1 with lowest p values
- Keep the variable and fit all possible models with one extra predictor added to one you already have.
- Consider the predictor with the lowest P-value.if P value < SL go to step 3 else FIN.
- Go to step 3. Keep the previous model
- STAY 5% and ENTER 5%
- Perform next step of Forward selection p<ENTER
- Perform next step of Backward selection p<STAY.go to step 2
- Until no new variable.
- Select a criterion of goodness of fit.
- Construct all possible Regression Models. 2^n-1
- Select with nest criterion
Backward Elimination os the fastest.
Y = b0 + b1x1 + b2x2^2 + … If the data distribution is non linear we need a non linear curve to match the data better. We add vector for powers of the feature
They are used to describe diseases might spread etc. Why we didn’t split the data to training and test set? We have very few number of observations. So we take all the rows.
Comparing Linear and Polynomial Models using marplot
Adds a layer of buffer to Linear regression Line. There are some points which lie out side Epsilon tube, these are slack variables. The min distance from slack variable defines the Linear Buffer line passing through the data.
The ones which have implicit relationship we have to apply feature scaling.
We also apply the feature scaling to Label/ Dependant variable.
Classification Tress Regression Trees
Each partition is called a leaf. Algorithm finds splits and final leaves are called terminal leaves.
The above figure shows how the decision tree is constructed using the splits.
You take average of the points within the terminal leaves which will be assigned to prediction.
Add the above average values to Decision tree and the data coming in will use this decision tree to make predictions.
We don’t have to apply feature scaling for decision tree. They work with highly complex datasets
This is version of ensemble learning, which when you take same algorithm multiple times to make it better.
- Pick random data points from training set.
- Build a decision tree based on data points selected above
- Keep building regression trees.
- Use all of them to predict.