From cab332b27ea4aef7f329eac01c599063353a5ed9 Mon Sep 17 00:00:00 2001 From: Violeta Sosa Date: Sat, 2 Jan 2021 06:17:26 +0100 Subject: [PATCH] Update 03 - Classification.ipynb --- 03 - Classification.ipynb | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/03 - Classification.ipynb b/03 - Classification.ipynb index 993b150..5795a9b 100644 --- a/03 - Classification.ipynb +++ b/03 - Classification.ipynb @@ -20,7 +20,7 @@ "source": [ "## Binary Classification\n", "\n", - "Let's start by looking at an example of *binary classification*, where the model must predict a label that belongs to one of two classes. In this exercsie, we'll train a binary classifier to predict whether or not a patient should be tested for diabetes based on some medical data.\n", + "Let's start by looking at an example of *binary classification*, where the model must predict a label that belongs to one of two classes. In this exercise, we'll train a binary classifier to predict whether or not a patient should be tested for diabetes based on some medical data.\n", "\n", "### Explore the data\n", "\n", @@ -48,7 +48,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This data consists of diagnostic information about some patients who have been tested for diabetes. Scroll to the right if necessary, and note that the final column in the dataset (**Diabetic**) contains the value ***0*** for patients who tested negative for diabetes, and ***1*** for patients who tested positive. This is the label that we will train our mode to predict; most of the other columns (**Pregnancies**,**PlasmaGlucose**,**DiastolicBloodPressure**, and so on) are the features we will use to predict the **Diabetic** label.\n", + "This data consists of diagnostic information about some patients who have been tested for diabetes. Scroll to the right if necessary, and note that the final column in the dataset (**Diabetic**) contains the value ***0*** for patients who tested negative for diabetes, and ***1*** for patients who tested positive. This is the label that we will train our model to predict; most of the other columns (**Pregnancies**,**PlasmaGlucose**,**DiastolicBloodPressure**, and so on) are the features we will use to predict the **Diabetic** label.\n", "\n", "Let's separate the features from the labels - we'll call the features ***X*** and the label ***y***:" ] @@ -99,7 +99,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For some of the features, there's a noticable difference in the distribution for each label value. In particular, **Pregnancies** and **Age** show markedly different distributions for diabetic patients than for non-diabetic patients. These features may help predict whether or not a patient is diabetic.\n", + "For some of the features, there's a noticeable difference in the distribution for each label value. In particular, **Pregnancies** and **Age** show markedly different distributions for diabetic patients than for non-diabetic patients. These features may help predict whether or not a patient is diabetic.\n", "\n", "### Split the data\n", "\n", @@ -227,7 +227,7 @@ "\n", "> note that the header row may not line up with the values!\n", "\n", - "* *Precision*: Of the predictons the model made for this class, what proportion were correct?\n", + "* *Precision*: Of the predictions the model made for this class, what proportion were correct?\n", "* *Recall*: Out of all of the instances of this class in the test dataset, how many did the model identify?\n", "* *F1-Score*: An average metric that takes both precision and recall into account.\n", "* *Support*: How many instances of this class are there in the test dataset?\n", @@ -320,7 +320,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The decision to score a prediction as a 1 or a 0 depends on the threshold to which the predicted probabilties are compared. If we were to change the threshold, it would affect the predictions; and therefore change the metrics in the confusion matrix. A common way to evaluate a classifier is to examine the *true positive rate* (which is another name for recall) and the *false positive rate* for a range of possible thresholds. These rates are then plotted against all possible thresholds to form a chart known as a *received operator characteristic (ROC) chart*, like this:" + "The decision to score a prediction as a 1 or a 0 depends on the threshold to which the predicted probabilities are compared. If we were to change the threshold, it would affect the predictions; and therefore change the metrics in the confusion matrix. A common way to evaluate a classifier is to examine the *true positive rate* (which is another name for recall) and the *false positive rate* for a range of possible thresholds. These rates are then plotted against all possible thresholds to form a chart known as a *received operator characteristic (ROC) chart*, like this:" ] }, { @@ -383,10 +383,10 @@ "\n", "In practice, it's common to perform some preprocessing of the data to make it easier for the algorithm to fit a model to it. There's a huge range of preprocessing transformations you can perform to get your data ready for modeling, but we'll limit ourselves to a few common techniques:\n", "\n", - "- Scaling numeric features so they're on the same scale. This prevents feaures with large values from producing coefficients that disproportionately affect the predictions.\n", + "- Scaling numeric features so they're on the same scale. This prevents features with large values from producing coefficients that disproportionately affect the predictions.\n", "- Encoding categorical variables. For example, by using a *one hot encoding* technique you can create individual binary (true/false) features for each possible category value.\n", "\n", - "To apply these preprocessing transformations, we'll make use of a Scikit-Learn feature named *pipelines*. These enable us to define a set of preprocessing steps that end with an algorithm. You can then fit the entire pipeline to the data, so that the model encapsulates all of the preprocessing steps as well as the regression algorithm. This is useful, because when we want to use the model to predict values from new data, we need to apply the same transformations (based on the same statistical distributions and catagory encodings used with the training data).\n", + "To apply these preprocessing transformations, we'll make use of a Scikit-Learn feature named *pipelines*. These enable us to define a set of preprocessing steps that end with an algorithm. You can then fit the entire pipeline to the data, so that the model encapsulates all of the preprocessing steps as well as the regression algorithm. This is useful, because when we want to use the model to predict values from new data, we need to apply the same transformations (based on the same statistical distributions and category encodings used with the training data).\n", "\n", ">**Note**: The term *pipeline* is used extensively in machine learning, often to mean very different things! In this context, we're using it to refer to pipeline objects in Scikit-Learn, but you may see it used elsewhere to mean something else.\n" ] @@ -698,7 +698,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now that we know what the feaures and labels in the data represent, let's explore the dataset. First, let's see if there are any missing (*null*) values." + "Now that we know what the features and labels in the data represent, let's explore the dataset. First, let's see if there are any missing (*null*) values." ] }, { @@ -1204,4 +1204,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} \ No newline at end of file +}