This is an extension of Spark MLlib, implementing logistic regression with dropout regularization.
Dropout regularization usually works better than L2-regularization, as it emphasis the contribution of rarely occurring, but discriminative, features during classification [2]. This makes it well suited for application like NLP, where the data is sparse.
Having said that, it might actually act as a detriment when the data is extremely sparse, as dropping off some of the features in already sparse space might not leave sufficient information for the model to learn at all [4].
This repo is written in Scala with sbt, using Spark 1.3.0.
Use the following to run a simple example.
sbt
run-main dropout.example
To check performnce of NewsGroup-20 dataset (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html), run the following
sbt
run-main dropout.news20
-
Srivastava, Nitish, et al. "Dropout: A simple way to prevent neural networks from overfitting." The Journal of Machine Learning Research 15.1 (2014): 1929-1958.
-
Wager, Stefan, Sida Wang, and Percy S. Liang. "Dropout training as adaptive regularization." Advances in Neural Information Processing Systems. 2013.
-
http://www-nlp.stanford.edu/~sidaw/home/_media/papers:fastdropout.pdf
-
McMahan, H. Brendan, et al. "Ad click prediction: a view from the trenches." Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013.
The repo is not thoroughly tested. Performs might not be as expected. I will add more testing and examples along the way. Any comments or contribution are welcome.