-
Notifications
You must be signed in to change notification settings - Fork 147
Protocol
The protocol for communication between view objects and learning algorithms is defined in base.py. The key components are:
- Task
- View
- Protocol
- Learning algorithm
The following text explains how these pieces fit together, and then enumerates the different Task Semantics that are used in skdata so far.
As a researcher using skdata, you will probably be writing your own learning algorithm implementation, in order to collect the statistics you care about for your work. Writing your learning algorithm in the form described here will ease the process of adapting your experiment code to get results from different data sets that you can compare directly to other results from the relevant literature.
A task represents some data with very brief meta-data
describing what kind of data it is, and what to do with it.
It is not meant to encapsulate behaviour, it is a container object.
To use the
Iris view.py as an example,
we see at line 30 there is a method for creating tasks.
The Iris task method binds together (a) an input feature matrix, (b) an output label
vector, and (c) semantic meta-data "vector_classification"
.
In the course of running an experiment, it is typical to create many task
objects (e.g. one task for a training set for example, and another task for a
testing set) but they will typically all have the same semantic meta-data
descriptor.
The meta-data is the only mandatory task attribute, and its purpose is to
tell a learning algorithm what other attributes to expect in the task.
A view represents an interpretation of a data set (which is not generally
standardized in any way) as a standard type of learning problem and often
specifies particular train/test splits, feature
representations, and particular metrics for judging the success of models.
Technically, a view draws on a data set to define several tasks, and sequences them up
into a protocol.
The K-fold cross-validation evaluation protocol implemented by the Iris
example creates a train and a test tasks for each evaluation fold.
The tasks here are all labeled with the same "vector_classification"
meta-data,
which indicates to the learning algorithm that the tasks have a .x
feature matrix
and a .y
label vector, and that a model should be a classifier that
predicts y from x, and minimizes the number of classification errors.
A protocol is generally implemented by a view method called protocol
that takes a learning algorithm as an argument. So for example the view implementation
skdata.iris.view.KfoldClassification
has a protocol method that creates a
bunch (K) of train and test tasks, and then for each split calls
model = algo.best_model(train)
to train a model on the training data and then
algo.loss(model, test)
to tell the learning algorithm to measure
generalization error on the corresponding test data.
The protocol method works entirely by side-effect on the learning algorithm,
it does not typically modify the view object itself in any way, and it returns
its algo
argument as the method return value.
A learning algorithm is an object that provides the methods called by the
protocol. The idea is that algo.best_model
, to continue our example, will inspect the
meta-data of the training task, and produce some appropriate model for the
data. The best_model
implementation may also log some statistics of the
learning process to some internal variables, output files, etc.
When the protocol later tells the learning algo to measure loss, the idea is
that the learning algorithm will inspect the meta-data, and measure loss in an
appropriate way relative to a model that it previously produced.
The learning algorithm object works mainly by side-effect, storing internally
any kind of interesting logs or statistics about the model-fitting process or
the generalization error.
When the protocol function call returns, the machine learning experiment is done. The various results of the experiment should be stored in the learning algorithm object, and the data set view object should be in the same state that it was before the experiment began.
The data sets in skdata (those which have been written / upgraded to actually use this design) use the following task semantics. If you want to define new semantics for your own work, you can go ahead and do it, skdata does not need to be modified or notified in any way.
Task objects with this semantics must have:
- x - a matrix whose rows are feature vectors (of floats)
- y - a vector whose entries are integer labels in {0, 1, ..., n_classes - 1}
- n_classes - the number of possible classes
The x and y attributes will have the same length.
Task objects with this semantics must have:
- all_vectors - a matrix with shape: (examples, features)
- all_labels - a vector with shape (examples,) whose entries are integer labels in {0, 1, ..., n_classes - 1}
- idxs - a vector of non-negative "active" elements for advanced-indexing into all_vectors, and all_labels.
- n_classes - the number of possible classes
Task objects with this semantics must have:
- all_images - a 4-tensor with shape: (examples, height, width, channels)
- all_labels - a vector whose entries are integer labels in {0, 1, ..., n_classes - 1}
- idxs - a vector of non-negative "active" elements for advanced-indexing into all_images, and all_labels.
- n_classes - the number of possible classes