From 1537dd372a3e64251923792d9cc911dff12ed85f Mon Sep 17 00:00:00 2001 From: Manish Amde Date: Sat, 12 Apr 2014 22:32:24 -0700 Subject: [PATCH] added placeholders and some doc --- docs/mllib-classification-regression.md | 48 +++++++++++++++++-------- 1 file changed, 34 insertions(+), 14 deletions(-) diff --git a/docs/mllib-classification-regression.md b/docs/mllib-classification-regression.md index f27d692ae6e18..f73df9ad03f54 100644 --- a/docs/mllib-classification-regression.md +++ b/docs/mllib-classification-regression.md @@ -244,38 +244,51 @@ Also, note that `$A_{i:} \in \R^d$` is a row-vector, but the gradient is a colum Decision trees and their ensembles are popular methods for the machine learning tasks of classification and regression. Decision trees are widely used since they are easy to interpret, handle categorical variables, extend to the multi-class classification setting, do not require feature scaling and are able to capture non-linearities and feature interactions. Tree ensemble algorithms such as decision forest and boosting are among the top performers for classification and regression tasks. -### Mathematical Formulation +### Basic Algorithm + +The decision tree is a greedy algorithm performs a recursive binary partitioning of the feature space by finding the best *split* that maximimizes the information gain at each node. + +### Node Impurity and Information Gain + +The node impurity is a measure of the homogeneity of the labels at the node. The current implementation provides two impurity measures for classification and one impurity measure for regression. -### Information Gain +1. Gini index: **TODO: Write and explain formula** +1. Entropy: **TODO: Write and explain formula** +1. Variance: **TODO: Write and explain formula** -#### Classification +The information gain is the difference in the parent node impurity and the weighted sum of the two child node impurities. -#### Regression +TODO: **Write and explain formula** ### Feature Binning -#### Classfication +**Continuous Features** -#### Regression +**Categorical Features** -### Implementation +### Stopping Rule -#### Code Optimizations +**TODO: Explain maxDepth** -#### Experimental Results +### Experimental Results + +### Current Limitations ### Training Parameters -### Upcoming features +`maxBins`: + +`maxDepth`: -#### Multiclass Classification +`impurity`: -#### Decision Forest +`categoricalFeaturesInfo`: -#### AdaBoost +`quantileCalculationStrategy`: -#### Gradient Boosting +`algo`: +`strategy`: ## Implementation in MLlib @@ -404,6 +417,13 @@ println("training Mean Squared Error = " + MSE) Similarly you can use RidgeRegressionWithSGD and LassoWithSGD and compare training [Mean Squared Errors](http://en.wikipedia.org/wiki/Mean_squared_error). +## Decision Tree + +1. Classification: **TODO Write code and explain** +2. Classification with Categorical Features: **TODO Write code and explain** +3. Regression: **TODO Write code and explain** +4. Regression with Categorical Features: **TODO Write code and explain** + # Usage in Java