Skip to content

Commit

Permalink
Merge pull request MicrosoftDocs#20 from violetasdev/master
Browse files Browse the repository at this point in the history
Typos Chapter 1 & 2 ml-basics
  • Loading branch information
GraemeMalcolm authored Jan 5, 2021
2 parents c42ec15 + 9d64fab commit 11e3c9c
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 22 deletions.
12 changes: 6 additions & 6 deletions 01 - Data Exploration.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"> **Note**: If you've never used the Jupyter Notebooks environment before, there are a few things you should be aware of:\n",
"> \n",
"> - Notebooks are made up of *cells*. Some cells (like this one) contain *markdown* text, while others (like the one beneath this one) contain code.\n",
"> - The notebook is connected to a Python *kernel* (you can see which one at the top right of the page - if you're running this noptebook in an Azure Machine Learning compute instance it should be connected to the **Python 3.6 - AzureML** kernel). If you stop the kernel or disconnect from the server (for example, by closing and reopening the notebook, or ending and resuming your session), the output from cells that have been run will still be displayed; but any variables or functions defined in those cells will have been lost - you must rerun the cells before running any subsequent cells that depend on them.\n",
"> - The notebook is connected to a Python *kernel* (you can see which one at the top right of the page - if you're running this notebook in an Azure Machine Learning compute instance it should be connected to the **Python 3.6 - AzureML** kernel). If you stop the kernel or disconnect from the server (for example, by closing and reopening the notebook, or ending and resuming your session), the output from cells that have been run will still be displayed; but any variables or functions defined in those cells will have been lost - you must rerun the cells before running any subsequent cells that depend on them.\n",
"> - You can run each code cell by using the **► Run** button. The **◯** symbol next to the kernel name at the top right will briefly turn to **⚫** while the cell runs before turning back to **◯**.\n",
"> - The output from each code cell will be displayed immediately below the cell.\n",
"> - Even though the code cells can be run individually, some variables used in the code are global to the notebook. That means that you should run all of the code cells <u>**in order**</u>. There may be dependencies between code cells, so if you skip a cell, subsequent cells might not run correctly.\n",
Expand Down Expand Up @@ -89,7 +89,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that multiplying a list by 2 creates a new list of twice the length with the original sequence of list elements repeated. Multiplying a NumPy array on the other hand performs an element-wise calculation in which the array behaves like a *vector*, so we end up with an array of the same size in which each element has been multipled by 2.\n",
"Note that multiplying a list by 2 creates a new list of twice the length with the original sequence of list elements repeated. Multiplying a NumPy array on the other hand performs an element-wise calculation in which the array behaves like a *vector*, so we end up with an array of the same size in which each element has been multiplied by 2.\n",
"\n",
"The key takeaway from this is that NumPy arrays are specifically designed to support mathematical operations on numeric data - which makes them more useful for data analysis than a generic list.\n",
"\n",
Expand Down Expand Up @@ -433,7 +433,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The DataFrame's **read_csv** method is used to load data from text files. As you can see in the example code, you can specify options such as the column delimiter and which row (if any) contains column headers (in this case, the delimter is a comma and the first row contains the column names - these are the default settings, so the parameters could have been omitted).\n",
"The DataFrame's **read_csv** method is used to load data from text files. As you can see in the example code, you can specify options such as the column delimiter and which row (if any) contains column headers (in this case, the delimiter is a comma and the first row contains the column names - these are the default settings, so the parameters could have been omitted).\n",
"\n",
"\n",
"### Handling missing values\n",
Expand Down Expand Up @@ -857,7 +857,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The histogram for grades is a symmetric shape, where the most frequently occuring grades tend to be in the middle of the range (around 50), with fewer grades at the extreme ends of the scale.\n",
"The histogram for grades is a symmetric shape, where the most frequently occurring grades tend to be in the middle of the range (around 50), with fewer grades at the extreme ends of the scale.\n",
"\n",
"#### Measures of central tendency\n",
"\n",
Expand Down Expand Up @@ -1398,11 +1398,11 @@
"\n",
"> **Warning - Math Ahead!**\n",
">\n",
"> Cast your mind back to when you were learning how to solve linear equations in school, and recall that the *slope-intercept* form of a linear equation lookes like this:\n",
"> Cast your mind back to when you were learning how to solve linear equations in school, and recall that the *slope-intercept* form of a linear equation looks like this:\n",
">\n",
"> \\begin{equation}y = mx + b\\end{equation}\n",
">\n",
"> In this equation, *y* and *x* are the coordinate variables, *m* is the slope of the line, and *b* is the y-intercept (where the line goes through the Y axis).\n",
"> In this equation, *y* and *x* are the coordinate variables, *m* is the slope of the line, and *b* is the y-intercept (where the line goes through the Y-axis).\n",
">\n",
"> In the case of our scatter plot for our student data, we already have our values for *x* (*StudyHours*) and *y* (*Grade*), so we just need to calculate the intercept and slope of the straight line that lies closest to those points. Then we can form a linear equation that calculates a new *y* value on that line for each of our *x* (*StudyHours*) values - to avoid confusion, we'll call this new *y* value *f(x)* (because it's the output from a linear equation ***f***unction based on *x*). The difference between the original *y* (*Grade*) value and the *f(x)* value is the *error* between our regression line and the actual *Grade* achieved by the student. Our goal is to calculate the slope and intercept for a line with the lowest overall error.\n",
">\n",
Expand Down
14 changes: 7 additions & 7 deletions 02 - Regression.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
"\n",
"$$y = f([x_1, x_2, x_3, ...])$$\n",
"\n",
"The goal of training the model is to find a function that performs some kind of calcuation to the *x* values that produces the result *y*. We do this by applying a machine learning *algorithm* that tries to fit the *x* values to a calculation that produces *y* reasonably accurately for all of the cases in the training dataset.\n",
"The goal of training the model is to find a function that performs some kind of calculation to the *x* values that produces the result *y*. We do this by applying a machine learning *algorithm* that tries to fit the *x* values to a calculation that produces *y* reasonably accurately for all of the cases in the training dataset.\n",
"\n",
"There are lots of machine learning algorithms for supervised learning, and they can be broadly divided into two types:\n",
"\n",
Expand Down Expand Up @@ -221,7 +221,7 @@
"\n",
"- **holiday**: There are many fewer days that are holidays than days that aren't.\n",
"- **workingday**: There are more working days than non-working days.\n",
"- **weathersit**: Most days are category *1* (clear), with category *2* (mist and cloud) the next most common. There are comparitively few category *3* (light rain or snow) days, and no category *4* (heavy rain, hail, or fog) days at all.\n",
"- **weathersit**: Most days are category *1* (clear), with category *2* (mist and cloud) the next most common. There are comparatively few category *3* (light rain or snow) days, and no category *4* (heavy rain, hail, or fog) days at all.\n",
"\n",
"Now that we know something about the distribution of the data in our columns, we can start to look for relationships between the features and the **rentals** label we want to be able to predict.\n",
"\n",
Expand Down Expand Up @@ -278,7 +278,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The plots show some variance in the relationship between some category values and rentals. For example, there's a clear difference in the distribution of rentals on weekends (**weekday** 0 or 6) and those during the working week (**weekday** 1 to 5). Similarly, there are notable differences for **holiday** and **workingday** categories. There's a noticable trend that shows different rental distributions in summer and fall months compared to spring and winter months. The **weathersit** category also seems to make a difference in rental distribution. The **day** feature we created for the day of the month shows little variation, indicating that it's probably not predictive of the number of rentals."
"The plots show some variance in the relationship between some category values and rentals. For example, there's a clear difference in the distribution of rentals on weekends (**weekday** 0 or 6) and those during the working week (**weekday** 1 to 5). Similarly, there are notable differences for **holiday** and **workingday** categories. There's a noticeable trend that shows different rental distributions in summer and fall months compared to spring and winter months. The **weathersit** category also seems to make a difference in rental distribution. The **day** feature we created for the day of the month shows little variation, indicating that it's probably not predictive of the number of rentals."
]
},
{
Expand Down Expand Up @@ -307,7 +307,7 @@
"source": [
"After separating the dataset, we now have numpy arrays named **X** containing the features, and **y** containing the labels.\n",
"\n",
"We *could* train a model using all of the data; but it's common practice in supervised learning to split the data into two subsets; a (typically larger) set with which to train the model, and a smaller \"hold-back\" set with which to validate the trained model. This enables us to evaluate how well the model performs when used with the validation dataset by comparing the predicted labels to the known labels. It's important to split the data *randomly* (rather than say, taking the first 70% of the data for training and keeping the rest for validation). This helps ensure that the two subsets of data are statistically comparable (so we validate the model with data that has a similar statistical distibution to the data on which it was trained).\n",
"We *could* train a model using all of the data; but it's common practice in supervised learning to split the data into two subsets; a (typically larger) set with which to train the model, and a smaller \"hold-back\" set with which to validate the trained model. This enables us to evaluate how well the model performs when used with the validation dataset by comparing the predicted labels to the known labels. It's important to split the data *randomly* (rather than say, taking the first 70% of the data for training and keeping the rest for validation). This helps ensure that the two subsets of data are statistically comparable (so we validate the model with data that has a similar statistical distribution to the data on which it was trained).\n",
"\n",
"To randomly split the data, we'll use the **train_test_split** function in the **scikit-learn** library. This library is one of the most widely used machine learning packages for Python."
]
Expand Down Expand Up @@ -567,7 +567,7 @@
"\n",
"### Try an Ensemble Algorithm\n",
"\n",
"Ensemble algorithms work by combining multiple base estimators to produce an optimal model, either by appying an aggregate function to a collection of base models (sometimes referred to a *bagging*) or by building a sequence of models that build on one another to improve predictive performance (referred to as *boosting*).\n",
"Ensemble algorithms work by combining multiple base estimators to produce an optimal model, either by applying an aggregate function to a collection of base models (sometimes referred to a *bagging*) or by building a sequence of models that build on one another to improve predictive performance (referred to as *boosting*).\n",
"\n",
"For example, let's try a Random Forest model, which applies an averaging function to multiple Decision Tree models for a better overall model."
]
Expand Down Expand Up @@ -722,7 +722,7 @@
"\n",
"We trained a model with data that was loaded straight from a source file, with only moderately successful results.\n",
"\n",
"In practice, it's common to perform some preprocessing of the data to make it easier for the algorithm to fit a model to it. There's a huge range of preprocessing trasformations you can perform to get your data ready for modeling, but we'll limit ourselves to a few common techniques:\n",
"In practice, it's common to perform some preprocessing of the data to make it easier for the algorithm to fit a model to it. There's a huge range of preprocessing transformations you can perform to get your data ready for modeling, but we'll limit ourselves to a few common techniques:\n",
"\n",
"### Scaling numeric features\n",
"\n",
Expand All @@ -738,7 +738,7 @@
"| -- | --- | --- |\n",
"| 0.3 | 0.48| 0.65|\n",
"\n",
"There are multiple ways you can scale numeric data, such as calculating the minimum and maximum values for each column and assigning a proportional value between 0 and 1, or by using the mean and standard deviation of a normally distributed variable to mainatain the same *spread* of values on a different scale.\n",
"There are multiple ways you can scale numeric data, such as calculating the minimum and maximum values for each column and assigning a proportional value between 0 and 1, or by using the mean and standard deviation of a normally distributed variable to maintain the same *spread* of values on a different scale.\n",
"\n",
"### Encoding categorical variables\n",
"\n",
Expand Down
Loading

0 comments on commit 11e3c9c

Please sign in to comment.