-
Notifications
You must be signed in to change notification settings - Fork 223
Logistic Regression
This tool is a distributed implementation of the Logistic Regression with (Asynchronous) Stochastic Gradient Descent and FTRL-Proximal algorithm on top of Multiverso.
We test the tool in a Bing Ads click prediction dataset in Microsoft. The dataset is about 4TB with more than 5 billions of samples. The experiment is running on a cluster with 24 machines. Each machine has 20 physical cores and 256 GB ram and machines are connected with InfiniBand. The training of one epoch can be finished in about 18 minutes.
Follow the build guide to download and install first.
For single machine training, run
LogisticRegression config_file_path
Here is a simple example for training MNIST data set on single machine without parameter server.
To run in a distributed environment, run with MPI
mpirun -m $machine_file LogisticRegression config_file_path
Configure file is used to configure the setting of training. It is a text file, each line of which is formatted as key=value
. Below is a simple example to show the format of the config file. Suppose we're going to train a linear model with 100000-dimensions features with FTRL.
input_size=100000
output_size=2
objective_type=ftrl
train_epoch=1
sparse=true
use_ps=true
pipeline=true
minibatch_size=20
sync_frequency=5
train_file=D:/ftrl/part-1;D:/ftrl/part-2
test_file=D:/ftrl/test.data
reader_type=bsparse
output_file=D:/LogReg/ftrl.out
-
regular_type
, default will use no regularization. Can also be [L1 / L2]. -
objective_type
, [default / sigmoid / softmax / default] -
updater_type
, used when use no ps. [default / sgd / ftrl] -
input_size
, the dimension of features. It is used when training dense data. -
output_size
, the dimension of output result. -
sparse
, [true] for sparse data, [false] for dense data. -
train_epoch
, indicate the epoch number of training. -
minibatch_size
, LogReg use mini-batch sgd to do optimization, this is the mini-batch size. -
use_ps
, specify whether to use parameter server or not. [true] will use DMTK framework. -
learning_rate
, initial learning rate for sgd updater -
learning_rate_coef
, we use max(1e-3, initial - (update count - learning rate coef * minibatch size)) to update the learning rate. -
regular_coef
, coefficient for regularization term
-
pipeline
, whether to pipeline the computation and communication -
sync_frequency
, if use no pipeline, the worker will pull model after each sync_frequency mini-batch
alpha
beta
lambda1
lambda2
-
init_model_file
, when provided, will load model data when init -
output_model_file
, path to save binary model data -
train_file
, training data -
test_file
, testing data, the tool prints test error every epoch if test_file is provided -
output_file
, path to save test result
train_file and test_file can use semicolon to separate multiple files.
Input file can be of different format. Use reader_type
to specify the reader type.
- default, for text file. Each line as
# for sparse data use `libsvm` data format,
label key:value key:value ...
# for dense data,
label value value value ...
- weight, for text file. Some data set has a weight (double type) for each sample, for each line as:
# for sparse data use `libsvm` data format,
label:weight key:value key:value ...
# for dense data,
label:weight value value value ...
- bsparse, for binary file, only for sparse data. Each sample as:
count(size_t)label(int)weight(double)key(size_t)key(size_t)...
-
read_buffer_size
, use for reader to preload data. Should be larger than minibatch size * sync frequency. -
show_time_per_sample
, show statistic time after process each #show_time_per_sample sample, including computation, communication
DMTK
Multiverso
- Overview
- Multiverso setup
- Multiverso document
- Multiverso API document
- Multiverso applications
- Logistic Regression
- Word Embedding
- LightLDA
- Deep Learning
- Multiverso binding
- Run in docker
LightGBM