Tutorial

A brief and strait-forward tutorial on running icsiboost

Download and build the last version

You must have subversion and common build tools ready on your system to install icsiboost from the latest subversion commit. Then, you need to configure the source, build icsiboost and optionally install it in a binary directory. Use --prefix=<dir> to install the program in <dir>/bin. Don't forget to add this directory to your path.

git clone https://github.com/benob/icsiboost.git
cd icsiboost
./configure --prefix=$HOME CFLAGS=-O3
make
make install
export PATH=$PATH:$HOME/bin

icsiboost has been reported to work on Linux, Mac OSX (you will need to install PCRE from macports, fink, or from its sources), Windows and OpenSolaris.

A simple example

Let's first download example files from the UCI repository using wget:

wget http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names
wget http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
wget http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test

The database downloaded is the adult database: a classification problem where you have to determine the income of a person knowning various facts about him/her. The database contains a file describing the different classes and features (adult.names), a file containing training examples (adult.data) and a file containing test examples (adult.test).

The names file: adult.names

>50K, <=50K.
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
...
sex: Female, Male.
capital-gain: continuous.

The .names file contains a first line defining a comma separated list of classes ended by a period (here: >50K and <=50K being the classes of income that we try to predict).

Then one feature column is described on every line. The line consists of a feature name (age, workclass, ...) followed by a column and information about values allowed for that feature. Features can be real valued (continuous, as for age), space separated words (text) or a set of nominal values (the values themselves, as for sex). icsiboost implement decision stumps that will partition the training examples in 2 classes (above or below a threshold for continuous values; present or absent for text and nominal values).

The training examples: adult.data

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K.
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K.
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K.

The training examples are input in a single file containing one instance per line. The different feature columns are comma separated and must appear in the same order and following the same constraints described in the .names file. The last column contains the real class of the example, followed by a period.

Some features can be unknown because of privacy concerns or other restrictions on training data collection. These features use a value of "?" (question mark) and will get a special processing in the training/testing stages.

Training a classifier

icsiboost -S adult -n 100
rnd 1: wh-err= 0.724256 th-err= 0.724256 dev= nan test= 0.236226 train= 0.240810
rnd 2: wh-err= 0.908697 th-err= 0.658130 dev= nan test= 0.196917 train= 0.199073
rnd 3: wh-err= 0.928621 th-err= 0.611153 dev= nan test= 0.157791 train= 0.158472
rnd 4: wh-err= 0.960223 th-err= 0.586843 dev= nan test= 0.155764 train= 0.157335
rnd 5: wh-err= 0.980548 th-err= 0.575428 dev= nan test= 0.151711 train= 0.152053
rnd 6: wh-err= 0.982552 th-err= 0.565388 dev= nan test= 0.151711 train= 0.152053
rnd 7: wh-err= 0.988624 th-err= 0.558956 dev= nan test= 0.151035 train= 0.151807
rnd 8: wh-err= 0.991000 th-err= 0.553925 dev= nan test= 0.149438 train= 0.149780
rnd 9: wh-err= 0.993974 th-err= 0.550587 dev= nan test= 0.146797 train= 0.148583
rnd 10: wh-err= 0.993322 th-err= 0.546911 dev= nan test= 0.146367 train= 0.148337
...

You may want to read the papers about !AdaBoost before training a classifier to know more about the whole process. When you invoke icsiboost, you have to provide a stem for the names and training files (adult for adult.names and adult.data) and a number of iterations to proceed. At each iteration a new weak classifier will be trained and used in the final decision. icsiboost outputs the iteration number (rnd), the weighted error (wh-err) which is the objective function minimized when selected a classifier (Z() in the papers) and the theoretical error (th-err, see the papers) and the test and train classification error (ratio of misclassified examples over the number of examples). This last value is the most interesting as you want to reduce it a maximum. Adding more iterations will reduce the training error further, but there is a risk of over-training where the test error will increase while the training error still decreases. You should stop iterating before that phenomenon.

The model resulting from a training is output in the stem.shyp file. It contains all informations to rebuild the ensemble of weak classifiers during a testing stage.

Testing the performance of the classifier

icsiboost -S adult -C < adult.test
0 1 -0.321597856090 0.321597856090
0 1 -0.008942643370 0.008942643370
1 0 -0.047642775890 0.047642775890
1 0 0.229640678420 -0.229640678420
0 1 -0.313200595760 0.313200595760
0 1 -0.277185452430 0.277185452430
0 1 -0.169981706080 0.169981706080
1 0 0.036429657930 -0.036429657930
0 1 -0.263797352750 0.263797352750
0 1 -0.154846522240 0.154846522240
...

icsiboost will load a previously trained model and predict the classes of the examples from the standard input. A binary flag is first output for each true class (if available) and then, the prediction score for each class in the order of the class definition in the .names file. A positive value means that the class has been predicted by the classifier.

Getting the learning curve in gnuplot

You can plot the error curves of the training and test sets using gnuplot and the following command:

icsiboost -n 100 -S adult | tee adult.iter
gnuplot
plot 'adult.iter' using 2:12 with lines title 'training error', '' using 2:10 with lines title 'test error'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly