Skip to content

Commit

Permalink
more documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
manishamde committed Apr 14, 2014
1 parent 865826e commit 022485a
Showing 1 changed file with 12 additions and 5 deletions.
17 changes: 12 additions & 5 deletions docs/mllib-classification-regression.md
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,7 @@ The *node impurity* is a measure of the homogeneity of the labels at the node. T
<td>Entropy</td><td>Classification</td><td>$\sum_{i=1}^{M} -f_ilog(f_i)$</td><td>$f_i$ is the frequency of label $i$ at a node and $M$ is the number of unique labels.</td>
</tr>
<tr>
<td>Variance</td><td>Classification</td><td>$\frac{1}{n} \sum_{i=1}^{N} (x_i - \mu)^2$</td><td>$y_i$ is label for an instance, $N$ is the #instances and $\mu$ is the mean $\frac{1}{N} \sum_{i=1}^n x_i$.</td>
<td>Variance</td><td>Classification</td><td>$\frac{1}{n} \sum_{i=1}^{N} (x_i - \mu)^2$</td><td>$y_i$ is label for an instance, $N$ is the number of instances and $\mu$ is the mean given by $\frac{1}{N} \sum_{i=1}^n x_i$.</td>
</tr>
</tbody>
</table>
Expand Down Expand Up @@ -296,7 +296,7 @@ The recursive tree construction is stopped at a node when one of the two conditi

### Practical Limitations

The tree implementation stores an Array[Double] of *O(#features \* #splits \* 2^maxDepth)* in memory for aggregating histograms over partitions. The current implementation might not scale to very deep trees since the memory requirement grows exponentially with tree depth.
The tree implementation stores an Array[Double] of size *O(#features \* #splits \* 2^maxDepth)* in memory for aggregating histograms over partitions. The current implementation might not scale to very deep trees since the memory requirement grows exponentially with tree depth.

Please drop us a line if you encounter any issues. We are planning to solve this problem in the near future and real-world examples will be great.

Expand Down Expand Up @@ -435,6 +435,8 @@ Similarly you can use RidgeRegressionWithSGD and LassoWithSGD and compare traini

#### Classification

The example below demonstrates how to load a CSV file, parse it as an RDD of LabeledPoint and then perform classification using a decision tree using Gini index as an impurity measure and a maximum tree depth of 5. The training error is calculated to measure the algorithm accuracy.

{% highlight scala %}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.tree.DecisionTree
Expand Down Expand Up @@ -465,6 +467,9 @@ println("Training Error = " + trainErr)

#### Regression

The example below demonstrates how to load a CSV file, parse it as an RDD of LabeledPoint and then perform regression using a decision tree using variance as an impurity measure and a maximum tree depth of 5. The Mean Squared Error is computed at the end to evaluate
[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).

{% highlight scala %}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.tree.DecisionTree
Expand Down Expand Up @@ -505,7 +510,9 @@ calling `.rdd()` on your `JavaRDD` object.

Following examples can be tested in the PySpark shell.

## Binary Classification
## Linear Methods

### Binary Classification
The following example shows how to load a sample dataset, build Logistic Regression model,
and make predictions with the resulting model to compute the training error.

Expand All @@ -527,7 +534,7 @@ trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedDa
print("Training Error = " + str(trainErr))
{% endhighlight %}

## Linear Regression
### Linear Regression
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint.
The example then uses LinearRegressionWithSGD to build a simple linear model to predict label
values. We compute the Mean Squared Error at the end to evaluate
Expand All @@ -549,4 +556,4 @@ valuesAndPreds = parsedData.map(lambda point: (point.item(0),
model.predict(point.take(range(1, point.size)))))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y)/valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
{% endhighlight %}
{% endhighlight %}

0 comments on commit 022485a

Please sign in to comment.