Skip to content

Analysis Configuration

David López-García edited this page May 17, 2023 · 26 revisions

Defining a configuration file

For the sake of clarity and code organization, we recommend to include all the configuration code for a specific decoding analysis in an external configuration .m file. This file should be executed before the computation of the multivariate decoding analysis. This recommendation, however, is not mandatory and more experienced users can design their own scripts according to their needs and preferences.

For both scenarios, all the available configuration parameters in MVPAlab Toolbox will be described in detail during this section.

Participants and data directories

The first required information that should be specified by the user is the working directory and the location of the dataset to be imported and analyzed. This includes, for each class or condition, the name of each individual subject data file and the complete path of the class folder. These parameters can be defined in the configuration file as follows:

% Working directory:
cfg.location = pwd;

% Conditions data paths:
cfg.study.dataPaths{1,1} = 'C:\...\class_a\'; % Condition A
cfg.study.dataPaths{1,2} = 'C:\...\class_b\'; % Condition B

% Subjects data files:
cfg.study.dataFiles{1,1} = { % Condition Asubject_01.mat’, 
     ‘subject_02.mat’, 
     ‘subject_03.mat’
};

cfg.study.dataFiles{1,2} = { % Condition Bsubject_01.mat’, 
     ‘subject_02.mat’, 
     ‘subject_03.mat’
};

Trial average

If enabled, this approach randomly or sequentially averages a certain number of trials belonging to the same condition for each participant. This procedure creates supertrials and usually increases the signal-to-noise ratio (SNR) which improves the overall decoding performance and also reduces the computational load. Since reducing the number of trials per condition typically increases the variance in the decoding performance, this procedure imposes a trade-off between the increased variance/accuracy. It should be noted that increasing does not increase the decoding performance linearly.

The default parameters for this procedure can be modified in the MVPAlab configuration file as follows:

cfg.trialaver.flag      = true;
cfg.trialaver.ntrials   = 5;
cfg.trialaver.order     = 'rand';

Trial averaging can be enabled or disabled by setting the configuration variable .flag to true or false. The number of trials to average can be modified in .ntrials. Finally, the order in which the trials are selected for averaging can be modified setting the variable .order to 'rand' or 'sequential'.

Balanced datasets

Unbalanced datasets can lead to skewed classification results. To avoid this phenomenon, the number of trials per condition should be the same. MVPAlab can be used to define strictly balanced datasets by downsampling the majority class to match the size of the minority one cfg.classsize.match. In addition, each class size can be set as a factor of k, the total number of folds in the cross-validation (CV) procedure. Thus, during CV each fold will be composed by exactly the same number of observations, avoiding any kind of bias in the results cfg.classsize.matchkfold.

These features are disabled by default but can be enabled in the MVPAlab configuration structure as follows:

cfg.classsize.match       = true;
cfg.classsize.matchkfold  = true;

Data normalization

In machine learning, data normalization refers to the process of adjusting the range of the M/EEG raw data to a common scale without distorting differences in the ranges of values. Although classification algorithms work with raw values, normalization usually improves the efficiency and the performance of the classifiers. Four different (and excluding) data normalization methods are implemented in MVPAlab.

A commonly used normalization approach is computed within the cross-validation loop. Hence, the training and test sets are standardized as follows:

X_train = (X_train - μ_train) / σ_train
X_test = (X_test - μ_train) / σ_train

where μ_train and σ_train denote the mean and the standard deviation of each feature (column) of the training set. Other normalization methods implemented in MVPAlab are: z-score (μ=0 ; σ=1) across time, trial or features.

Data normalization method, which is disabled by default, can be modified as follows:

cfg.normdata = 4;  % 0 – Disabled
                   % 1 – ZSCORE across features
                   % 2 – ZSCORE across time
                   % 3 – ZSCORE across trials
                   % 4 – Nested in CV loop

Data smoothing

Data smoothing is a procedure employed in recent M/EEG studies to attenuate unwanted noise. MVPAlab implements an optional data smoothing step that can be computed before multivariate analyses.

Moving average:

This procedure is based on MATLAB builtin function smooth, which smooths M/EEG data points using a moving average filter.

The length of the smoothing window can be specified in the variable cfg.smoothdata.window and should be an odd number. For a window length of 5 time points, the smoothed version of the original signal is computed as follows:

y_smoothed(1) = y(1)
y_smoothed(2) = (y(1) + y(2) + y(3))/3
y_smoothed(3) = (y(1) + y(2) + y(3) + y(4) + y(5))/5
y_smoothed(4) = (y(2) + y(3) + y(4) + y(5) + y(6))/5
...

Gaussian kernel:

This procedure is based on MATLAB builtin function smoothdata. If this function is available in your MATLAB distribution it smooths M/EEG data points with a Gaussian-weighted moving average filter.

Data smoothing is disabled .method = 'none' by default and can be enabled and configured in the MVPAlab configuration file as follows:

cfg.smoothdata.method  = 'moving';    % Moving average
cfg.smoothdata.method  = 'gaussian';  % Gaussian kernel
cfg.smoothdata.window  = 5;

Analysis timing

By default, MVPAlab computes the time-resolved decoding analysis for each time point across the entire M/EEG epoch. However, the user can define a specific region of interest (time window) and a different step size as follows:

cfg.tm.tpstart   = -200;
cfg.tm.tpend     = 1500;
cfg.tm.tpsteps   = 3;

This way, the temporal decoding analysis will be computed from -200ms (.tpstart) to 1500ms (.tpend) not for each timepoint but for every three (.tpsteps) timepoints. Note that increasing the step size decreases the processing time but also causes a reduction in the temporal resolution of the decoding results.

Channel selection

By default, MVPAlab computes the selected decoding analysis using all the available channels (electrodes) in the dataset. However, a specific set of channels of interest can be defined, which allows you to focus on the channels that are most relevant to your analysis. This set of electrodes can be defined as follows:

cfg.channels.selected = [3, 13, 23, 33, 43, 53, 63];

Note that each number corresponds to the index of the selected electrode. In order to use the entire set of electrodes, the list of selected electrodes should be empty:

cfg.channels.selected = [];

Dimensionality reduction

In machine learning, dimension reduction techniques are a common practice to reduce the number of variables in high-dimensional datasets. During this process, the features contributing more significantly to the variance of the original dataset are automatically selected. In other words, most of the information contained in the original dataset can be represented using only the most discriminative features. As a result, dimensionality reduction facilitates, among others, classification, visualization, and compression of high-dimensional data.

There are different dimensionality reduction approaches but Principal Component Analysis (PCA) is probably the most popular multivariate statistical technique used in almost all scientific disciplines, including neuroscience. PCA in particular is a linear transformation of the original dataset in an orthogonal coordinate system in which axis coordinates (principal components) correspond to the directions of highest variance sorted by importance.

To maintain the model's performance as fair and unbiased as possible, PCA is computed only for training sets X_training, independently for each fold inside the cross-validation procedure. Once PCA for the corresponding training set is computed and the model is trained, the exact same transformation is applied to the test set X_test (including centering, μ_training). In other words, the test set is projected onto the reduced feature space obtained during the training stage.

However, dimensionality reduction techniques such PCA endorse a trade-off between the benefits of dimension reduction (reduced training time, reduced redundant data and im-proved accuracy) and the interpretation of the results when electrodes are used as features. When PCA is computed, the data is projected from the sensor space onto the reduced PCA features space. This linear transformation implies an intrinsic loss of spatial information, which means that, for example, we cannot directly analyze which electrodes are contributing more to decoding performance.

The default parameters for this procedure can be modified in the MVPAlab configuration file as follows:

cfg.dimred.flag    = true;
cfg.dimred.method  = 'pca'; 
cfg.dimred.ncomp   = 5;

Classification model

Classification algorithms are the cornerstone of decoding analyses. These mathematical models play the central role in multivariate analyses: detect subtle changes in patterns in the data that are usually not detected using less sensitive approaches. Different classification algorithms have been used to achieve this goal, from probabilistic-based models such as Discriminant Analyses (DA), Logistic Regressions (LR) or Naïve Bayes (NB) to supervised learning algorithms such Support Vector Machine (SVM).

For the time being, MVPAlab Toolbox implements two of the most commonly employed models in the neuroscience literature, Support Vector Machines and Discriminant Analysis in their linear and non-linear variants.


MVPAlab-models


The classification model employed for the decoding analysis can be specified in the configuration file as follows:

cfg.classmodel.method = 'svm';
cfg.classmodel.method = 'da';

Both classification approaches are based on MATLAB built-in libraries for support vector machines and discriminant analyses. Please see MATLAB documentation of fitcsvm and fitcdiscr functions for further details.

MVPAlab uses linear SVM classifiers for decoding analysis by default, but other kernel functions for non-linear classification can be specified in the MVPAlab configuration file as follows:

cfg.classmodel.kernel = 'linear';
cfg.classmodel.kernel = 'gaussian';
cfg.classmodel.kernel = 'rbf';
cfg.classmodel.kernel = 'polynomial';

When enabled, Linear Discriminant analysis is configured by default in MVPAlab Toolbox but, as for SVM, this kernel function can be modified in the configuration file as follows:

cfg.classmodel.kernel = 'quadratic';

Cross-validation

In prediction models, cross-validation techniques are used to estimate how well the classification algorithm generalizes to unknow data. Two popular approaches for evaluating the performance of a classification model on a specific data set are k-fold and leave-one-out cross validation.

In general, these techniques randomly split the original dataset into two different subsets, the training set X_training: 1-1⁄K percent of the exemplars, and the test set X_test: 1⁄K percent of the exemplars. This procedure is repeated K times (folds), selecting different and disjoint subsets for each iteration. Thus, for each fold, the classification model is trained for the training set and evaluated using exemplars belonging to the test set. The final classification performance value for a single timepoint is the mean performance value for all iterations.

When K and the total number of exemplars (instances) are equal, this procedure is called leave-one-out cross-validation. Here, the classification model is trained with all but one of the exemplars and evaluated with the remaining exemplar. By definition, this approach is computationally demanding and time consuming for large datasets, and for that reason is usually employed only with small sets of data.

The cross-validation procedure can be tuned in the MVPAlab configuration file as follows:

cfg.cv.method  = 'kfold';
cfg.cv.nfolds  = 5;

If .method = 'loo' the number of folds is automatically updated to match the total number of exemplars for each participant.

Performance metrics

(1) Mean accuracy is usually employed to evaluate decoding models' performance in neuroscience studies. This metric is fast, easy to compute and is defined as the number of hits over the total number of evaluated trials. By default, MVPAlab Toolbox returns the mean accuracy value as a measure of decoding performance. Nevertheless, in situations with very skewed sample distributions, this metric may generate systematic and undesired biases in the results. Other performance metrics, such as the balanced accuracy have been proposed to mitigate this problem.

Accuracy values can be complemented with the (2) confusion matrices, which are very useful for binary classification but even more so for multiclass scenarios. In machine learning, a confusion matrix allows the visualization of the performance of an algorithm , reporting false positives (FP), false negatives (FN), true positives (TP), and true negatives (TN). To this end, a confusion matrix reflects the predicted versus the actual classes. Rows correspond to true class and columns to predicted classes. Thus, the element CM(i,j) indicates the number (or the proportion) of exemplars of class i classified as class j.

Other interesting and more informative performance metrics available in MVPAlab are derivations of the confusion matrix:

  • (3) Precision (PR=TP)⁄((TP+FP)): proportion of trials labeled as positive that actually belong to the positive class.
  • (4) Recall (also known as sensitivity) (R=TP)⁄((TP+FN)): proportion of positive trials that are retrieved by the classifier.
  • (5) F1-score (F1=2TP)⁄((2TP+FP+FN)): combination of precision and recall in a single score through the harmonic mean.

Nonetheless, nonparametric, criterion-free estimates, such as the Area Under the ROC Curve (AUC), have been proved as a better measure of generalization for imbalanced datasets. This curve is used for a more rigorous examination of a model's performance. The AUC provides a way to evaluate the performance of a classification model: the larger the area, the more accurate the classification model is. This metric is one of the most suitable evaluation criteria, as it shows how well the model distinguishes between conditions, by facing the sensitivity (True Positive Rate (TPR)) against 1-specificity (False Positive Rate (FPR))

By default, MVPAlab only returns the mean accuracy, although other performance metrics can be enabled in the configuration file as follows:

cfg.classmodel.roc       = false;
cfg.classmodel.auc       = false;
cfg.classmodel.confmat   = false;
cfg.classmodel.precision = false;
cfg.classmodel.recall    = false;
cfg.classmodel.f1score   = true;

Users should be aware that enabling several performance metrics will significantly increase the computation time and memory requirements to store the results.

Parallel computation

The MVPAlab Toolbox is adapted and optimized for parallel computation. If the Parallel Computing Toolbox (MATLAB) is installed and available, MVPAlab can compute several timepoints simultaneously. Therefore, the computational load is distributed among the different CPU cores, significantly decreasing the processing time. This feature becomes critical specially when the user is dealing with large datasets and needs to compute several thousand of permutation-based analyses. Parallel computation is disabled by default but can be enabled in the MVPAlab configuration file as follows:

cfg.classmodel.parcomp  = true;
Clone this wiki locally