Skip to content

sarthaktyagi-505/Machine-Learning-101

Repository files navigation

Machine-Learning-101

Pre Processing

Data Pre Processing Notes:

  1. Pandas library is super important for data pre processing.

  2. pd.read_csv(‘path/to/dataset’) creates a dataframe .

  3. In any dataset with which you train a model

    1. Features/Independant Variable: columns with which we predict dependant variable. Mostly present in first columns.
    2. Labels/Dependant variable factor: This is usually in last column of dataset and is something which has to be predicted.
  4. Missing data can be harmful for training the data, so in turn sometimes we end up filling it using various tools. (Like missing salary can be replaced by average salaries)

  5. Encoding categorical data:

    1. Sometimes there are some string categories of data which is difficult for the machine to understand so we encode this data.One simple way is to give numbers to each data. For eg Spain :1, Germany: 2 etc.
  6. Better way is one hot encoding is turning the country column to three dfifferent something like. [1,0,0]: Spain [0,1,0]: Germany This will be easier for machine to interpret

  7. Label encoding: dependant variable we use Label Encoder.

  8. When should be feature scaling done?

    1. It should be done after splitting the dataset into train and test.
    2. Feature scaling is taking the variables in same scale.
    3. Test set is supposed to be brand new, so should not be worked before training.
    4. Feature scaling works on the data before training
  9. Feature Scaling:

    1. Allows us to put all the features to same scale a. Standardisation (- 3 to +3): it works well all the time b. Normalisation (0 to 1): recommended when distribution is normal

Go with Standardisation 10. Only apply transform on test dataset

Classficiation

Unlike classification where we predict a continuous number, we use classification to classify the the objects in to different category. It finds its use cases in a lot of medical and marketing fields.

Some main Classification problems are:

  1. Logistic Regression
  2. K nearest Neighbors
  3. Support Vector Machine
  4. Kernel SVM
  5. Naive Bayes
  6. Decision Tree Classification
  7. Random Forest Classification

Logistic Regression:

Screenshot 2020-08-12 at 5 56 23 PM

Applying Linear regression to classification problem.

We stage a likelihood of the person taking an offer.

Linear Regression at least gives us the range of people taking the offer. The line in between 1 and 0 makes sense but not the line above or below it.

Screenshot 2020-08-12 at 5 59 41 PM

We cut of the lines from top and below. Apply a sigmoid function to /y = mx +c/

Logistic Regression

log(p/p-1) = b0 + b1x

Screenshot 2020-08-12 at 6 02 06 PM

An example with various age groups›

Screenshot 2020-08-12 at 6 11 38 PM

We use the probability to have a score. But what if we don’t want probability and ask for prediction. Screenshot 2020-08-12 at 6 14 11 PM

Anything having probability of less than 0.5 its projected to 0 and anything above is projected upward.

K Nearest Neighbor:

Screenshot 2020-08-13 at 6 32 17 PM

Rule Guide:

  1. Choose the number K of neighbours, default values is 5.
  2. Take Manhattan distance or euclidian distance.
  3. Count the data points in different category.
  4. Assign new data point to the category where you counted most neighbours.

Screenshot 2020-08-13 at 6 35 18 PM

Euclidean distance

Screenshot 2020-08-13 at 6 35 49 PM

Take Euclidean distance from 5 points and assign the category.

Support Vector Machine

Screenshot 2020-08-13 at 6 54 34 PM

Screenshot 2020-08-13 at 6 56 05 PM

SVM tries to pick extreme cases of categories which is risky. If we are differentiating between apples and oranges most of algorithms will look at only most common features SVM looks at the boundary conditions and tries to create a Line to separate them.

Screenshot 2020-08-13 at 7 01 00 PM

Kernel SVM

What if the Data cannot be separated Linearly. In that case we use Kernel SVM

Screenshot 2020-08-13 at 7 35 59 PM

Map data into Linearly separable dataset using Higher Dimension.

Screenshot 2020-08-13 at 7 40 23 PM

Post applying some function, Hyperplane separates the data.

Screenshot 2020-08-13 at 7 41 56 PM

Mapping to higher dimension it can lead to more compute power being required

The Kernel Trick

Screenshot 2020-08-13 at 7 45 56 PM

If landmark is large then we get a value very close to zero. If the landmark is closer to reference point it be smaller and e^0 is 1.

Screenshot 2020-08-13 at 7 51 09 PM

We use the kernel to separate our data. Screenshot 2020-08-13 at 7 52 26 PM

Anything out side circle will assigned 0, anything outside will be 1. Sigma defines how wide can the circumference of cycle can be. By finding the right sigma we find the distinction.

Types of kernel function:

Screenshot 2020-08-13 at 8 06 18 PM

Non Linear SVR

Screenshot 2020-08-14 at 1 52 48 AM

Screenshot 2020-08-14 at 1 54 28 AM

If we project hyperplane which is same as running a linear model in 3 D. We use a RBF vector to create a 3 D plot of data and run Hyplerplane to get minimum error

Naives Bayes:

Screenshot 2020-08-14 at 10 28 04 PM

How can the above be applied to train a model.

Screenshot 2020-08-14 at 10 31 40 PM

Decision Tree:

Screenshot 2020-08-17 at 4 16 31 PM

Splits in such a way to maximise a category in each split. It is very similar to Regression part of this, difference being in the algorithm to classify or regress.

Random Forest Classifier:

Ensemble learning is when you take a lot of models and train them together and take their average. For eg We can have multiple points on the basis of which we can device Decision tress who’s combined average gives us a comparatively good result.

CAP Analysis:

Screenshot 2020-08-24 at 4 32 07 PM

K Means Clustering

Screenshot 2020-08-24 at 4 50 17 PM

Screenshot 2020-08-24 at 5 00 21 PM

Move the centroids

Screenshot 2020-08-24 at 5 04 55 PM

When no new re assignments take place we can assume the algorithm Has converged.

Screenshot 2020-08-24 at 5 06 44 PM

The selection of centroid can hinder the selection of clusters. There is modification in K Means clustering algorithm which is k means ++ algorithm.

Screenshot 2020-08-24 at 5 21 33 PM

When there is only 1 cluster. The value of WCSS will be very large.

Screenshot 2020-08-24 at 5 23 55 PM

When clusters are increased to 2. The WCSS decreases.

Screenshot 2020-08-24 at 5 24 30 PM

When the number clusters is 3.

Screenshot 2020-08-24 at 5 22 59 PM

We can have as many clusters, but how to find optimal fit. Screenshot 2020-08-24 at 5 27 57 PM

When the drop in WCSS becomes less, We see an elbow point which is the number of clusters required for data modelling

Regression

Regression model are used for predicting future values of a particular nature.

Simple Linear Regression:

Y = b0 + b1*X1

Y: Dependant Variable(Something you try to understand how is it dependant on something) X: Independent Variable which might and might not affect the dependant variable b1: Coefficient (connector between y and x) b0: Constant term.

If we are trying to figure out salary for x work ex: Salary = b0 + b1 * Experience

Constant : Intersection on Y axis (Starting salary) B1: Slope of the line. Screenshot 2020-08-09 at 7 42 56 PM

Simple Linear Regression:

Screenshot 2020-08-09 at 7 43 05 PM

Simple Linear Regression will make many such lines from actual to assumed and calculate

SUM(y - y`)^2 -> min

The Line represents the predicted line trying to fit through test dataset.

Screenshot 2020-08-09 at 9 05 56 PM

Screenshot 2020-08-09 at 8 58 26 PM

Multiple Regression

Much more coefficients as compared to Linear regression Y = b0 + b1x1 + b2x2 + b3*x3…

There are some Independent variables or features which we have to throw out to increase the accuracy of the model. Only important features should be selected.

P- Value:

The value at which we discard a feature and assume its not adding any value to model.

Methods to select the correct features:

  1. All in
  2. Backward Elimination
  3. Forward Selection
  4. Bidirectional Elimination
  5. Score Comparison

Stepwise regression is Backward, Forward and Bidirectional.

All in:

Throw all the variables to build the model. It is used for prepping for Backward Elimination

Backward Elimination:

  1. Select significance level to stay in the model. 5%
  2. Fit the full model with all possible predictors
  3. Consider the predictor with the highest P value
    1. If p > SL go to step 4 else fin.
  4. Remove predictor
  5. Fit the model without this variable.
  6. Go to step 3

Forward Selection

  1. Select significance level to 5 %
  2. Fit all the simple regression models. Select 1 with lowest p values
  3. Keep the variable and fit all possible models with one extra predictor added to one you already have.
  4. Consider the predictor with the lowest P-value.if P value < SL go to step 3 else FIN.
  5. Go to step 3. Keep the previous model

Bidirectional Elimination

  1. STAY 5% and ENTER 5%
  2. Perform next step of Forward selection p<ENTER
  3. Perform next step of Backward selection p<STAY.go to step 2
  4. Until no new variable.

All Models:

  1. Select a criterion of goodness of fit.
  2. Construct all possible Regression Models. 2^n-1
  3. Select with nest criterion

Backward Elimination os the fastest.

Polynomial Linear Regression

Y = b0 + b1x1 + b2x2^2 + … If the data distribution is non linear we need a non linear curve to match the data better. We add vector for powers of the feature

Screenshot 2020-08-10 at 2 23 07 PM

They are used to describe diseases might spread etc. Why we didn’t split the data to training and test set? We have very few number of observations. So we take all the rows.

Comparing Linear and Polynomial Models using marplot

Screenshot 2020-08-10 at 3 02 09 PM

Screenshot 2020-08-10 at 3 03 40 PM

Support Vector Regression

Screenshot 2020-08-10 at 3 48 57 PM

Adds a layer of buffer to Linear regression Line. There are some points which lie out side Epsilon tube, these are slack variables. The min distance from slack variable defines the Linear Buffer line passing through the data.

The ones which have implicit relationship we have to apply feature scaling.

We also apply the feature scaling to Label/ Dependant variable.

Decision Trees

Classification Tress Regression Trees

Screenshot 2020-08-10 at 9 38 41 PM

Each partition is called a leaf. Algorithm finds splits and final leaves are called terminal leaves.

Screenshot 2020-08-10 at 9 41 51 PM

The above figure shows how the decision tree is constructed using the splits.

You take average of the points within the terminal leaves which will be assigned to prediction.

Screenshot 2020-08-10 at 9 43 48 PM

Add the above average values to Decision tree and the data coming in will use this decision tree to make predictions.

Screenshot 2020-08-10 at 9 45 21 PM

We don’t have to apply feature scaling for decision tree. They work with highly complex datasets

Random Forest Regression

This is version of ensemble learning, which when you take same algorithm multiple times to make it better.

  1. Pick random data points from training set.
  2. Build a decision tree based on data points selected above
  3. Keep building regression trees.
  4. Use all of them to predict.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages