diff --git a/01 - Data Exploration.ipynb b/01 - Data Exploration.ipynb index 5c8cc0e..502792f 100644 --- a/01 - Data Exploration.ipynb +++ b/01 - Data Exploration.ipynb @@ -15,7 +15,7 @@ "> **Note**: If you've never used the Jupyter Notebooks environment before, there are a few things you should be aware of:\n", "> \n", "> - Notebooks are made up of *cells*. Some cells (like this one) contain *markdown* text, while others (like the one beneath this one) contain code.\n", - "> - The notebook is connected to a Python *kernel* (you can see which one at the top right of the page - if you're running this noptebook in an Azure Machine Learning compute instance it should be connected to the **Python 3.6 - AzureML** kernel). If you stop the kernel or disconnect from the server (for example, by closing and reopening the notebook, or ending and resuming your session), the output from cells that have been run will still be displayed; but any variables or functions defined in those cells will have been lost - you must rerun the cells before running any subsequent cells that depend on them.\n", + "> - The notebook is connected to a Python *kernel* (you can see which one at the top right of the page - if you're running this notebook in an Azure Machine Learning compute instance it should be connected to the **Python 3.6 - AzureML** kernel). If you stop the kernel or disconnect from the server (for example, by closing and reopening the notebook, or ending and resuming your session), the output from cells that have been run will still be displayed; but any variables or functions defined in those cells will have been lost - you must rerun the cells before running any subsequent cells that depend on them.\n", "> - You can run each code cell by using the **► Run** button. The **◯** symbol next to the kernel name at the top right will briefly turn to **⚫** while the cell runs before turning back to **◯**.\n", "> - The output from each code cell will be displayed immediately below the cell.\n", "> - Even though the code cells can be run individually, some variables used in the code are global to the notebook. That means that you should run all of the code cells **in order**. There may be dependencies between code cells, so if you skip a cell, subsequent cells might not run correctly.\n", @@ -89,7 +89,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that multiplying a list by 2 creates a new list of twice the length with the original sequence of list elements repeated. Multiplying a NumPy array on the other hand performs an element-wise calculation in which the array behaves like a *vector*, so we end up with an array of the same size in which each element has been multipled by 2.\n", + "Note that multiplying a list by 2 creates a new list of twice the length with the original sequence of list elements repeated. Multiplying a NumPy array on the other hand performs an element-wise calculation in which the array behaves like a *vector*, so we end up with an array of the same size in which each element has been multiplied by 2.\n", "\n", "The key takeaway from this is that NumPy arrays are specifically designed to support mathematical operations on numeric data - which makes them more useful for data analysis than a generic list.\n", "\n", @@ -433,7 +433,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The DataFrame's **read_csv** method is used to load data from text files. As you can see in the example code, you can specify options such as the column delimiter and which row (if any) contains column headers (in this case, the delimter is a comma and the first row contains the column names - these are the default settings, so the parameters could have been omitted).\n", + "The DataFrame's **read_csv** method is used to load data from text files. As you can see in the example code, you can specify options such as the column delimiter and which row (if any) contains column headers (in this case, the delimiter is a comma and the first row contains the column names - these are the default settings, so the parameters could have been omitted).\n", "\n", "\n", "### Handling missing values\n", @@ -857,7 +857,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The histogram for grades is a symmetric shape, where the most frequently occuring grades tend to be in the middle of the range (around 50), with fewer grades at the extreme ends of the scale.\n", + "The histogram for grades is a symmetric shape, where the most frequently occurring grades tend to be in the middle of the range (around 50), with fewer grades at the extreme ends of the scale.\n", "\n", "#### Measures of central tendency\n", "\n", @@ -1398,11 +1398,11 @@ "\n", "> **Warning - Math Ahead!**\n", ">\n", - "> Cast your mind back to when you were learning how to solve linear equations in school, and recall that the *slope-intercept* form of a linear equation lookes like this:\n", + "> Cast your mind back to when you were learning how to solve linear equations in school, and recall that the *slope-intercept* form of a linear equation looks like this:\n", ">\n", "> \\begin{equation}y = mx + b\\end{equation}\n", ">\n", - "> In this equation, *y* and *x* are the coordinate variables, *m* is the slope of the line, and *b* is the y-intercept (where the line goes through the Y axis).\n", + "> In this equation, *y* and *x* are the coordinate variables, *m* is the slope of the line, and *b* is the y-intercept (where the line goes through the Y-axis).\n", ">\n", "> In the case of our scatter plot for our student data, we already have our values for *x* (*StudyHours*) and *y* (*Grade*), so we just need to calculate the intercept and slope of the straight line that lies closest to those points. Then we can form a linear equation that calculates a new *y* value on that line for each of our *x* (*StudyHours*) values - to avoid confusion, we'll call this new *y* value *f(x)* (because it's the output from a linear equation ***f***unction based on *x*). The difference between the original *y* (*Grade*) value and the *f(x)* value is the *error* between our regression line and the actual *Grade* achieved by the student. Our goal is to calculate the slope and intercept for a line with the lowest overall error.\n", ">\n", diff --git a/02 - Regression.ipynb b/02 - Regression.ipynb index 0cae051..95addc5 100644 --- a/02 - Regression.ipynb +++ b/02 - Regression.ipynb @@ -14,7 +14,7 @@ "\n", "$$y = f([x_1, x_2, x_3, ...])$$\n", "\n", - "The goal of training the model is to find a function that performs some kind of calcuation to the *x* values that produces the result *y*. We do this by applying a machine learning *algorithm* that tries to fit the *x* values to a calculation that produces *y* reasonably accurately for all of the cases in the training dataset.\n", + "The goal of training the model is to find a function that performs some kind of calculation to the *x* values that produces the result *y*. We do this by applying a machine learning *algorithm* that tries to fit the *x* values to a calculation that produces *y* reasonably accurately for all of the cases in the training dataset.\n", "\n", "There are lots of machine learning algorithms for supervised learning, and they can be broadly divided into two types:\n", "\n", @@ -221,7 +221,7 @@ "\n", "- **holiday**: There are many fewer days that are holidays than days that aren't.\n", "- **workingday**: There are more working days than non-working days.\n", - "- **weathersit**: Most days are category *1* (clear), with category *2* (mist and cloud) the next most common. There are comparitively few category *3* (light rain or snow) days, and no category *4* (heavy rain, hail, or fog) days at all.\n", + "- **weathersit**: Most days are category *1* (clear), with category *2* (mist and cloud) the next most common. There are comparatively few category *3* (light rain or snow) days, and no category *4* (heavy rain, hail, or fog) days at all.\n", "\n", "Now that we know something about the distribution of the data in our columns, we can start to look for relationships between the features and the **rentals** label we want to be able to predict.\n", "\n", @@ -278,7 +278,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The plots show some variance in the relationship between some category values and rentals. For example, there's a clear difference in the distribution of rentals on weekends (**weekday** 0 or 6) and those during the working week (**weekday** 1 to 5). Similarly, there are notable differences for **holiday** and **workingday** categories. There's a noticable trend that shows different rental distributions in summer and fall months compared to spring and winter months. The **weathersit** category also seems to make a difference in rental distribution. The **day** feature we created for the day of the month shows little variation, indicating that it's probably not predictive of the number of rentals." + "The plots show some variance in the relationship between some category values and rentals. For example, there's a clear difference in the distribution of rentals on weekends (**weekday** 0 or 6) and those during the working week (**weekday** 1 to 5). Similarly, there are notable differences for **holiday** and **workingday** categories. There's a noticeable trend that shows different rental distributions in summer and fall months compared to spring and winter months. The **weathersit** category also seems to make a difference in rental distribution. The **day** feature we created for the day of the month shows little variation, indicating that it's probably not predictive of the number of rentals." ] }, { @@ -307,7 +307,7 @@ "source": [ "After separating the dataset, we now have numpy arrays named **X** containing the features, and **y** containing the labels.\n", "\n", - "We *could* train a model using all of the data; but it's common practice in supervised learning to split the data into two subsets; a (typically larger) set with which to train the model, and a smaller \"hold-back\" set with which to validate the trained model. This enables us to evaluate how well the model performs when used with the validation dataset by comparing the predicted labels to the known labels. It's important to split the data *randomly* (rather than say, taking the first 70% of the data for training and keeping the rest for validation). This helps ensure that the two subsets of data are statistically comparable (so we validate the model with data that has a similar statistical distibution to the data on which it was trained).\n", + "We *could* train a model using all of the data; but it's common practice in supervised learning to split the data into two subsets; a (typically larger) set with which to train the model, and a smaller \"hold-back\" set with which to validate the trained model. This enables us to evaluate how well the model performs when used with the validation dataset by comparing the predicted labels to the known labels. It's important to split the data *randomly* (rather than say, taking the first 70% of the data for training and keeping the rest for validation). This helps ensure that the two subsets of data are statistically comparable (so we validate the model with data that has a similar statistical distribution to the data on which it was trained).\n", "\n", "To randomly split the data, we'll use the **train_test_split** function in the **scikit-learn** library. This library is one of the most widely used machine learning packages for Python." ] @@ -567,7 +567,7 @@ "\n", "### Try an Ensemble Algorithm\n", "\n", - "Ensemble algorithms work by combining multiple base estimators to produce an optimal model, either by appying an aggregate function to a collection of base models (sometimes referred to a *bagging*) or by building a sequence of models that build on one another to improve predictive performance (referred to as *boosting*).\n", + "Ensemble algorithms work by combining multiple base estimators to produce an optimal model, either by applying an aggregate function to a collection of base models (sometimes referred to a *bagging*) or by building a sequence of models that build on one another to improve predictive performance (referred to as *boosting*).\n", "\n", "For example, let's try a Random Forest model, which applies an averaging function to multiple Decision Tree models for a better overall model." ] @@ -722,7 +722,7 @@ "\n", "We trained a model with data that was loaded straight from a source file, with only moderately successful results.\n", "\n", - "In practice, it's common to perform some preprocessing of the data to make it easier for the algorithm to fit a model to it. There's a huge range of preprocessing trasformations you can perform to get your data ready for modeling, but we'll limit ourselves to a few common techniques:\n", + "In practice, it's common to perform some preprocessing of the data to make it easier for the algorithm to fit a model to it. There's a huge range of preprocessing transformations you can perform to get your data ready for modeling, but we'll limit ourselves to a few common techniques:\n", "\n", "### Scaling numeric features\n", "\n", @@ -738,7 +738,7 @@ "| -- | --- | --- |\n", "| 0.3 | 0.48| 0.65|\n", "\n", - "There are multiple ways you can scale numeric data, such as calculating the minimum and maximum values for each column and assigning a proportional value between 0 and 1, or by using the mean and standard deviation of a normally distributed variable to mainatain the same *spread* of values on a different scale.\n", + "There are multiple ways you can scale numeric data, such as calculating the minimum and maximum values for each column and assigning a proportional value between 0 and 1, or by using the mean and standard deviation of a normally distributed variable to maintain the same *spread* of values on a different scale.\n", "\n", "### Encoding categorical variables\n", "\n", diff --git a/03 - Classification.ipynb b/03 - Classification.ipynb index 993b150..5795a9b 100644 --- a/03 - Classification.ipynb +++ b/03 - Classification.ipynb @@ -20,7 +20,7 @@ "source": [ "## Binary Classification\n", "\n", - "Let's start by looking at an example of *binary classification*, where the model must predict a label that belongs to one of two classes. In this exercsie, we'll train a binary classifier to predict whether or not a patient should be tested for diabetes based on some medical data.\n", + "Let's start by looking at an example of *binary classification*, where the model must predict a label that belongs to one of two classes. In this exercise, we'll train a binary classifier to predict whether or not a patient should be tested for diabetes based on some medical data.\n", "\n", "### Explore the data\n", "\n", @@ -48,7 +48,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This data consists of diagnostic information about some patients who have been tested for diabetes. Scroll to the right if necessary, and note that the final column in the dataset (**Diabetic**) contains the value ***0*** for patients who tested negative for diabetes, and ***1*** for patients who tested positive. This is the label that we will train our mode to predict; most of the other columns (**Pregnancies**,**PlasmaGlucose**,**DiastolicBloodPressure**, and so on) are the features we will use to predict the **Diabetic** label.\n", + "This data consists of diagnostic information about some patients who have been tested for diabetes. Scroll to the right if necessary, and note that the final column in the dataset (**Diabetic**) contains the value ***0*** for patients who tested negative for diabetes, and ***1*** for patients who tested positive. This is the label that we will train our model to predict; most of the other columns (**Pregnancies**,**PlasmaGlucose**,**DiastolicBloodPressure**, and so on) are the features we will use to predict the **Diabetic** label.\n", "\n", "Let's separate the features from the labels - we'll call the features ***X*** and the label ***y***:" ] @@ -99,7 +99,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For some of the features, there's a noticable difference in the distribution for each label value. In particular, **Pregnancies** and **Age** show markedly different distributions for diabetic patients than for non-diabetic patients. These features may help predict whether or not a patient is diabetic.\n", + "For some of the features, there's a noticeable difference in the distribution for each label value. In particular, **Pregnancies** and **Age** show markedly different distributions for diabetic patients than for non-diabetic patients. These features may help predict whether or not a patient is diabetic.\n", "\n", "### Split the data\n", "\n", @@ -227,7 +227,7 @@ "\n", "> note that the header row may not line up with the values!\n", "\n", - "* *Precision*: Of the predictons the model made for this class, what proportion were correct?\n", + "* *Precision*: Of the predictions the model made for this class, what proportion were correct?\n", "* *Recall*: Out of all of the instances of this class in the test dataset, how many did the model identify?\n", "* *F1-Score*: An average metric that takes both precision and recall into account.\n", "* *Support*: How many instances of this class are there in the test dataset?\n", @@ -320,7 +320,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The decision to score a prediction as a 1 or a 0 depends on the threshold to which the predicted probabilties are compared. If we were to change the threshold, it would affect the predictions; and therefore change the metrics in the confusion matrix. A common way to evaluate a classifier is to examine the *true positive rate* (which is another name for recall) and the *false positive rate* for a range of possible thresholds. These rates are then plotted against all possible thresholds to form a chart known as a *received operator characteristic (ROC) chart*, like this:" + "The decision to score a prediction as a 1 or a 0 depends on the threshold to which the predicted probabilities are compared. If we were to change the threshold, it would affect the predictions; and therefore change the metrics in the confusion matrix. A common way to evaluate a classifier is to examine the *true positive rate* (which is another name for recall) and the *false positive rate* for a range of possible thresholds. These rates are then plotted against all possible thresholds to form a chart known as a *received operator characteristic (ROC) chart*, like this:" ] }, { @@ -383,10 +383,10 @@ "\n", "In practice, it's common to perform some preprocessing of the data to make it easier for the algorithm to fit a model to it. There's a huge range of preprocessing transformations you can perform to get your data ready for modeling, but we'll limit ourselves to a few common techniques:\n", "\n", - "- Scaling numeric features so they're on the same scale. This prevents feaures with large values from producing coefficients that disproportionately affect the predictions.\n", + "- Scaling numeric features so they're on the same scale. This prevents features with large values from producing coefficients that disproportionately affect the predictions.\n", "- Encoding categorical variables. For example, by using a *one hot encoding* technique you can create individual binary (true/false) features for each possible category value.\n", "\n", - "To apply these preprocessing transformations, we'll make use of a Scikit-Learn feature named *pipelines*. These enable us to define a set of preprocessing steps that end with an algorithm. You can then fit the entire pipeline to the data, so that the model encapsulates all of the preprocessing steps as well as the regression algorithm. This is useful, because when we want to use the model to predict values from new data, we need to apply the same transformations (based on the same statistical distributions and catagory encodings used with the training data).\n", + "To apply these preprocessing transformations, we'll make use of a Scikit-Learn feature named *pipelines*. These enable us to define a set of preprocessing steps that end with an algorithm. You can then fit the entire pipeline to the data, so that the model encapsulates all of the preprocessing steps as well as the regression algorithm. This is useful, because when we want to use the model to predict values from new data, we need to apply the same transformations (based on the same statistical distributions and category encodings used with the training data).\n", "\n", ">**Note**: The term *pipeline* is used extensively in machine learning, often to mean very different things! In this context, we're using it to refer to pipeline objects in Scikit-Learn, but you may see it used elsewhere to mean something else.\n" ] @@ -698,7 +698,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now that we know what the feaures and labels in the data represent, let's explore the dataset. First, let's see if there are any missing (*null*) values." + "Now that we know what the features and labels in the data represent, let's explore the dataset. First, let's see if there are any missing (*null*) values." ] }, { @@ -1204,4 +1204,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} \ No newline at end of file +}