The aim project is to implement your own decision tree and then apply it to a dataset.
About the Iris dataset:
The dataset contains features a set of four dimensional feature vectors
-
$0$ : Iris Setosa -
$1$ : Iris Versicolour -
$2$ : Iris Virginica
and the features correspond to different attributes that are shared across these flowers:
-
$x_{i1}$ is sepal length in centimeters -
$x_{i2}$ is sepal width in centimeters -
$x_{i3}$ is petal length in centimeters -
$x_{i4}$ is petal width in centimeters
The complete dataset contains 150 samples of flowers, 50 of each type.
We can calculate impurity of data for a given class
So let's start calculating an estimate of
where the index function
Write the function prior(targets, classes)
that calculates the prior probability of each class type given a list of all targets and all class types.
Example inputs and outputs:
prior([0, 0, 1], [0, 1])
->[2/3, 1/3]
prior([0, 2, 3, 3], [0, 1, 2, 3])
->[1/4, 0, 1/4, 2/4]
Let's assume that we have have made a split and create two data sets for the descendent nodes of the root node. For the sake of argument, let's say the we split so that features with
Write a function split_data(features, targets, split_feature_index, theta)
that returns two tuples: (features_1 targets_1), (features_2, targets_2)
as explained above. Here, split_feature_index
corresponds to k
.
Note: We will be using the Iris dataset from now on and the tools.load_iris()
function. You can take a look at examples.using_iris()
for more information.
Example inputs and outputs:
features, targets, classes = load_iris()
(f_1, t_1), (f_2, t_2) = split_data(features, targets, 2, 4.65)
f_1
should contain 90 samples and f_2
contain 60 samples.
Note: In further examples in section 1, we will use this dataset split, i.e. split_feature_index=2
and theta=4.65
.
Now we can calculate entropy, Gini or misclassification impurity. Let's go with Gini impurity.
Write a function gini_impurity(targets, classes)
that calculates the gini impurity of a single branch.
Example inputs and outputs:
gini_impurity(t_1, classes)
-> 0.2517283950617284gini_impurity(t_2, classes)
-> 0.1497222222222222
Using this formulation we can calculate the impurity for each branch but the question remains, what is the overall impurity of the split?
We could simply take the average:
A better overall impurity measure is to weight the descendants node's impurity with the number of data points that belong to each class:
And this is the value that we want to minimize when we make a split in the tree.
Write a function weighted_impurity(t1, t2, classes)
where t1
are targets that belong to the first branch and t2
belong to the second.
Example inputs and outputs:
weighted_impurity(t_1, t_2, classes)
-> 0.2109259259259259
Write a function that calculates
for a given dataset. It should have the form: total_gini_impurity(features, targets, classes, split_feature_index, theta)
which returns the weighted impurity given the dataset and threshold to split on.
This function should use your split_data
from earlier and weighted_impurity
.
Example inputs and outputs:
total_gini_impurity(features, targets, classes, 2, 4.65)
Output: 0.2109259259259259
The best threshold can now be found by searching through all dimensions. Brute force search can be used to find the best dimension and the threshold value because we have an objective function and a very simple dataset.
Create a function brute_best_split(features, targets, classes, num_tries)
where num_tries
corresponds to how many different thresholds to try for each feature dimension. This function should return:
- The lowest Gini impurity value found
- The threshold value for that value
- The dimension where that threshold is found
Example inputs and outputs:
brute_best_split(features, targets, classes, 30)
Output: (0.16666666666666666, 2, 1.9516129032258065)
To determine what interval of values to test, take a look at example.exclusive_interval
.
We have everything we need to implement the standard ID3 algorithm for growing decision trees, but we will use scikit-learn instead to continue.
The scikit-learn package contains utilities to train and analyse decision trees. The aim here is to set up a decision tree for the Iris dataset classification problem and to see few variations on how to deploy decision trees.
We will create a class IrisTreeTrainer
that extends sklearn.tree
. The class should implement the following methods:
Class IrisTreeTrainer:
def __init__(self, ...):
...
def train(self, ...):
...
def accuracy(self, ...):
...
def guess(self, ...):
...
def plot(self, ...):
...
def confusion_matrix(self, ...):
...
An example usage of this class would be:
features, targets, classes = load_iris()
dt = IrisTreeTrainer(features, targets, classes=classes)
dt.train()
print(f'The accuracy is: {dt.accuracy()}')
dt.plot()
print(f'I guessed: {dt.guess()}')
print(f'The true targets are: {dt.test_targets}')
print(dt.confusion_matrix())
The __init__
method is supplied to you in the template. See example.using_classes()
for more information on python classes.
Lets now work on each method at a time.:
Implement IrisTreeTrainer.train(self)
which should fit self.tree
to the training data. To fit a sklearn.tree
to some data (features
, targets
), we do tree.fit(features, targets)
.
Implement IrisTreeTrainer.accuracy(self)
which returns the accuracy of the decision tree on the test data.
Implement IrisTreeTrainer.plot(self)
that uses sklearn.tree.plot_tree
and plt.show()
to display your decision tree
Turn in the plot you get as 2_3_1.png
Implement IrisTreeTrainer.guess(self)
that returns predictions on the test data. To predict using sklearn.tree
and some data features
we do tree.predict(features)
.
Implement IrisTreeTrainer.confusion_matrix(self)
that returns the confusion matrix on the test data. You should implement this metric yourself!
Note: This is a pre-formulated independent question. In future assignments you will be asked to demonstrate your capability to add relevant insight to your assignment. The work suggested below is examplary of the type of insight you might add in future assignments. To get full marks on this assignment you must complete this independent part.
Add a method IrisTreTrainer.plot_progress()
to the class that plots the accuracy on the test set as a function of training samples. You should start by training on only one sample and end on training on all the training samples.
Turn in the plot you get as bonus_1.png
Running the following code should result in a similar graph as the one below:
features, targets, classes = load_iris()
dt = IrisTreeTrainer(features, targets, classes=classes, train_ratio=0.6)
dt.plot_progress()