Skip to content

universal rank-order method to analyze noisy data

License

Notifications You must be signed in to change notification settings

samuehae/rankorder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Universal rank-order method

Highly vectorized python implementation of a universal rank-order method to analyze noisy data. The method is valuable for

  • nonlinear regression with heavy-tailed noise, and for
  • distinguishing between deterministic chaos and stochastic noise.

The method is briefly outlined in the section nonlinear regression and detailed in the publication

G. Ierley and A. Kostinski. Universal rank-order transform to extract signals from noisy data. Phys Rev. X 9, 031039 (arXiv version)

Nonlinear regression

The aim of nonlinear regression is to extract parameters from data through a mathematical model. Concretely, the regression using the rank-order method requires

  • data points (xk, yik), with ns sampling locations xk and nr repeated observations yik at each loaction, and
  • a model function f(x, ξ) with adjustable parameters ξ = (ξ1, ..., ξnp).

These ingredients are illustrated in Fig. 1 for certain values of the parameters ξ. From the residuals rik = yik - f(xk, ξ) the method estimates the remaining signal content by a universal transform. The optimal parameters ξ* are then found by numerically minimizing the signal content in the residuals.

Figure 1: Ingredients for nonlinear regression

Universal transform

The universal transform consists of the following steps as exemplified in Fig. 2.

  1. The transform starts with the residual matrix rik. All residuals are ranked with increasing values for each row separately (blue shading) and build up the entries of the rank matrix Rik. In other words, the element rik has rank Rik in the ith row of the residual matrix.

  2. From the rank matrix Rik the occurances of each rank value are counted for each column separately (orange shading) and are summarized in the population matrix Prk. That is the rank r appears Prk times in the kth column of the rank matrix.

  3. The population matrix Prk is partitioned once in vertical and in horizontal direction between the jth and (j+1)th row and kth and (k+1)th column, respectively (green cross). From the partition the element

follows, where the angle brackets ⟨·⟩ denote the mean over the highlighted matrix elements.
  1. In the end we obtain the Q matrix. Finally, the signal content in the residuals is measured by the root mean square (rms) of the elements Qjk:

The ranking makes the method robust to outliers as ranks are ignorant to the precise residual values. Thus the universal regression method is valuable for data with heavy-tailed noise, where it outperforms the least-squares method.

Figure 2: Steps of the universal transform

Implementation

As every regression analysis requires many universal transforms, the speed of the transform is crucial. It is improved by using vectorized functions from numpy and a method with better time complexity as presented in the publication

D. Kestner, G. Ierley and A. Kostinski. A fast algorithm for computing a matrix transform used to detect trends in noisy data. Comput. Phys. Commun. 254, 107382 (arXiv version)

Installation

The package can be installed and updated via the package manager pip with either of the following commands.

# install package
pip install git+https://github.com/samuehae/rankorder.git

# install package with additional dependencies for running examples
pip install git+https://github.com/samuehae/rankorder.git#egg=rankorder[examples]

The usage of the package is demonstrated with the Python scripts included in the folder examples.

The authors of the publication Comput. Phys. Commun. 254, 107382 provide implementations for Julia and Matlab.

Practical challenges and solutions

This section discusses practical challenges of the universal regression method and suggests potential solutions. Most of them are related to the discrete nature of method.

Additive offsets

The universal method ignores additive offsets in the data, since the ranking step discards any information about values. As a consequence the model function can and should neglect additive offsets (see examples/hyperbola_fit.py).

If desired additive offsets can be retrieved from the residuals after fitting, for example as the mean ⟨rik⟩. This follows from the unweighted least-squares method with a constant model function f(x, ξ) = ξ.

Missing data

In practice data sets might have missing data points, which should be excluded from the regression analysis. Such points (xk, yik) can simply be masked by setting the variable yik to numpy.nan (see examples/hyperbola_fit.py).

Numerical minimization

The rough structure of the cost function Qrms versus the parameters ξ may trap the optimization routine in local minima (see examples/biexponential_fit.py). This challenge may be solved with the following suggestions.

  • Try different minimization routines that are free of derivatives. For example the Nelder-Mead (scipy.optimize.minimize) or Basin-hopping (scipy.optimize.basinhopping) routines.

  • Tune the arguments of the minimization routines, such as the initial guess and the tolerances.

  • Plot the cost function Qrms versus the parameters ξ or a subset thereof. This may help to shift the initial guess closer to the global minimum.

  • Reduce the number of free parameters np if possible. For example by fixing parameters with separate measurements or by using a simpler model function.

  • To smoothen the cost function, increase the number of repetitions nr by collecting or simulating observations. For simulating observations, the measured observations are copied and weak normal noise is added. An appropriate choice for the standard deviation is a fraction of the uncertainty inherent to the data.

Random noise

Data with less random noise exhibits a sharper minimum in the root mean square Qrms versus the fit parameters ξ around their optimal values ξ* (see examples/hyperbola_fit.py).

Accordingly it is challenging to find the minimum for data with small noise level. Pre-whitening the data can thus help to find the global minimum. A suitable choice is to add normal noise with a standard deviation being a fraction of the uncertainty inherent to the data. The normal noise can be created with the function numpy.random.normal.

Ties in ranking

During the ranking step it may happen that the same residual value appears several times. Especially for data with little noise, where the residuals can become close to zero. Such ties can be treated in different ways (see function rankorder.fitting.q_rms_fit):

  • Ordinal ranking methods break ties according to the location of the elements in the residual matrix. The method might introduce biases, which can be mitigated as follows.

  • Pre-whitening the data with random noise avoids ties in the first place and randomizes remaining ties during the ordinal ranking method. The pre-whitening is detailed in the section random noise.

  • Random ranking resolves ties by shuffling the order of ties randomly. A side effect is that the minimization needs to deal with a fluctuating cost function (root mean square Qrms).