diff --git a/AA2324/course/09_decision_trees/09_decision_trees.ipynb b/AA2324/course/09_decision_trees/09_decision_trees.ipynb index 856176d..8231f63 100644 --- a/AA2324/course/09_decision_trees/09_decision_trees.ipynb +++ b/AA2324/course/09_decision_trees/09_decision_trees.ipynb @@ -204,7 +204,7 @@ "source": [ "# This lecture material is taken from\n", "- Information Theory part - (Entropy etc) is taken from __Chapter 1 - Bishop__.\n", - "- Decision Trees are very briefly covered in __Bishop at page 663__.\n", + "- Decision Trees are very briefly covered in __Bishop on page 663__.\n", "- [Cimi Book - Chapter 01](http://ciml.info/dl/v0_99/ciml-v0_99-ch01.pdf)\n", "- [CSC411: Introduction to Machine Learning](https://www.cs.toronto.edu/~urtasun/courses/CSC411_Fall16/06_trees_handout.pdf)\n", "- [CSC411: Introduction to Machine Learning - Tutorial](https://www.cs.toronto.edu/~urtasun/courses/CSC411_Fall16/tutorial3.pdf)\n", @@ -299,7 +299,7 @@ } }, "source": [ - "# What is the the training error of $k$-NN? 🤔" + "# What is the training error of $k$-NN? 🤔" ] }, { @@ -310,7 +310,7 @@ } }, "source": [ - "- In $k$-NN there there is no explicit cost/loss, how can we measure the training error? \n" + "- In $k$-NN there is no explicit cost/loss, how can we measure the training error? \n" ] }, { @@ -630,7 +630,7 @@ "source": [ "# When $k=1$ we perfectly classify the training set! 100% accuracy!\n", "\n", - "It is easy to show that this follow by definition **(each point is neighbour to itself).**\n", + "It is easy to show that this follows by definition **(each point is neighbour to itself).**\n", "\n", "but will this hold for $K \\gt 1$?" ] @@ -643,7 +643,7 @@ } }, "source": [ - "# We record the training accuracy in function of increasing $k$" + "# We record the training accuracy in the function of increasing $k$" ] }, { @@ -912,7 +912,7 @@ "source": [ "# Remember to estimate scaling on the training set only!\n", "\n", - "- In theory this is part below is an error.\n", + "- In theory this part below is an error.\n", "- I took the code from sklearn documentation but in practice you have to estimate the scale parameters **ONLY** in the training set.\n", "- Then applying it directly to the test set. \n", "- If you work in inductive settings, you cannot do it jointly like the code above.\n", @@ -1268,7 +1268,7 @@ } }, "source": [ - "# Plot Miclassification function for binary case\n", + "# Plot Misclassification function for binary case\n", "\n", "```python\n", "pk = np.arange(0, 1.1, 0.1)\n", @@ -1772,7 +1772,7 @@ "source": [ "# This lecture material is taken from\n", "- Information Theory part - (Entropy etc) is taken from __Chapter 1 - Bishop__.\n", - "- Decision Trees are very briefly covered in __Bishop at page 663__.\n", + "- Decision Trees are very briefly covered in __Bishop on page 663__.\n", "- [Cimi Book - Chapter 01](http://ciml.info/dl/v0_99/ciml-v0_99-ch01.pdf)\n", "- [CSC411: Introduction to Machine Learning](https://www.cs.toronto.edu/~urtasun/courses/CSC411_Fall16/06_trees_handout.pdf)\n", "- [CSC411: Introduction to Machine Learning - Tutorial](https://www.cs.toronto.edu/~urtasun/courses/CSC411_Fall16/tutorial3.pdf)\n", @@ -2010,8 +2010,8 @@ "\n", " \n", "- **Termination**:\n", - " 1. if no examples – return **majority** from parent (Voting such as in k-NN).\n", - " 2. else if all examples in same class – return the class **(pure node)**.\n", + " 1. if no examples – return **majority** from the parent (Voting such as in k-NN).\n", + " 2. else if all examples are in the same class – return the class **(pure node)**.\n", " 3. else we are not in a termination node (keep recursing)\n", " 4. **[Optional]** we could also terminate for some **regularization** parameters" ] @@ -3276,7 +3276,7 @@ "\n", "$$G(Q, \\theta) = \\frac{N^{L}}{N} H(Q^{L}(\\theta)) + \\frac{N^{R}}{N} H(Q^{R}(\\theta))\n", "$$\n", - "Select the parameters that minimises the impurity\n", + "Select the parameters that minimize the impurity\n", "\n", "$$\n", "\\boldsymbol{\\theta}^* = \\operatorname{argmin}_\\boldsymbol{\\theta} G(Q_m, \\theta)\n", @@ -3483,7 +3483,7 @@ "source": [ "# Quick Remedies\n", "\n", - "However, even if we have these **on-hand weapon to avoid overfitting**, it is **still hard to train a single decision tree to perform well generally**. Thus, we will use another useful training technique called **ensemble methods or bagging**, which leads to random-forest." + "However, even if we have these **on-hand weapons to avoid overfitting**, it is **still hard to train a single decision tree to perform well generally**. Thus, we will use another useful training technique called **ensemble methods or bagging**, which leads to random-forest." ] }, { @@ -3725,7 +3725,7 @@ "- $K=\\sqrt{D}$ so it is a fixed hyper-param.\n", "- You have to tune $M$ but in general it needs to be large.\n", "- DT are **very interpretable**; DT/RF could be used for **feature selection**\n", - " - To answer the question: __which feature contribute more to the label?__\n", + " - To answer the question: __which feature contributes more to the label?__\n", "- You can evaluate them **without a validation split** (Out of Bag Generalization - OOB)" ] }, @@ -4052,7 +4052,7 @@ "\n", "[Link to the Microsoft paper](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/BodyPartRecognition.pdf)\n", "\n", - "_To keep the training times down we employ a distributed implementation. Training 3 trees to depth 20 from 1 million images takes about a day on a 1000 core cluster._" + "_To keep the training times down we employ a distributed implementation. Training 3 trees to depth 20 from 1 million images takes about a day on a 1000-core cluster._" ] }, {