Skip to content

Commit

Permalink
added placeholders and some doc
Browse files Browse the repository at this point in the history
  • Loading branch information
manishamde committed Apr 13, 2014
1 parent d06511d commit 1537dd3
Showing 1 changed file with 34 additions and 14 deletions.
48 changes: 34 additions & 14 deletions docs/mllib-classification-regression.md
Original file line number Diff line number Diff line change
Expand Up @@ -244,38 +244,51 @@ Also, note that `$A_{i:} \in \R^d$` is a row-vector, but the gradient is a colum

Decision trees and their ensembles are popular methods for the machine learning tasks of classification and regression. Decision trees are widely used since they are easy to interpret, handle categorical variables, extend to the multi-class classification setting, do not require feature scaling and are able to capture non-linearities and feature interactions. Tree ensemble algorithms such as decision forest and boosting are among the top performers for classification and regression tasks.

### Mathematical Formulation
### Basic Algorithm

The decision tree is a greedy algorithm performs a recursive binary partitioning of the feature space by finding the best *split* that maximimizes the information gain at each node.

### Node Impurity and Information Gain

The node impurity is a measure of the homogeneity of the labels at the node. The current implementation provides two impurity measures for classification and one impurity measure for regression.

### Information Gain
1. Gini index: **TODO: Write and explain formula**
1. Entropy: **TODO: Write and explain formula**
1. Variance: **TODO: Write and explain formula**

#### Classification
The information gain is the difference in the parent node impurity and the weighted sum of the two child node impurities.

#### Regression
TODO: **Write and explain formula**

### Feature Binning

#### Classfication
**Continuous Features**

#### Regression
**Categorical Features**

### Implementation
### Stopping Rule

#### Code Optimizations
**TODO: Explain maxDepth**

#### Experimental Results
### Experimental Results

### Current Limitations

### Training Parameters

### Upcoming features
`maxBins`:

`maxDepth`:

#### Multiclass Classification
`impurity`:

#### Decision Forest
`categoricalFeaturesInfo`:

#### AdaBoost
`quantileCalculationStrategy`:

#### Gradient Boosting
`algo`:

`strategy`:


## Implementation in MLlib
Expand Down Expand Up @@ -404,6 +417,13 @@ println("training Mean Squared Error = " + MSE)
Similarly you can use RidgeRegressionWithSGD and LassoWithSGD and compare training
[Mean Squared Errors](http://en.wikipedia.org/wiki/Mean_squared_error).

## Decision Tree

1. Classification: **TODO Write code and explain**
2. Classification with Categorical Features: **TODO Write code and explain**
3. Regression: **TODO Write code and explain**
4. Regression with Categorical Features: **TODO Write code and explain**


# Usage in Java

Expand Down

0 comments on commit 1537dd3

Please sign in to comment.