This is the official repository for the FastMapSVM algorithm associated with "Classifying seismograms using the FastMap algorithm and support-vector machines" (White et al., 2023). If you make use of this code, please cite White et al. (2023) where appropriate (full reference at bottom of page).
>$ pip install .
This tutorial is available as a IPython notebook in tutorials/tutorial.ipynb
.
import fastmap
import h5py
import matplotlib.pyplot as plt
import numpy as np
import sklearn.metrics
import sklearn.pipeline
import sklearn.preprocessing
import sklearn.svm
This tutorial demonstrates how to train and deploy the FastMapSVM classification model. As the name implies, FastMapSVM comprises two critical components: (1) the FastMap algorithm, and (2) an SVM classifier. The FastMap algorithm is implemented as a data transformation compatible with the sklearn.pipeline.Pipeline
API. This allows the FastMapSVM model to be implemented as a simple Pipeline
with optional intermediate transformations, such as data re-scaling.
To deploy the FastMapSVM algorithm, the user must define the distance function that quantifies the dissimilarity between any pair of objects in the train/test data. The distance function must adhere to NumPy's broadcasting rules: Given input arrays a
and b
with shapes
def correlation_distance(a, b, axis=-1):
'''
Compute the pair-wise correlation distance matrix.
'''
xcorr = correlate(a, b, axis=axis)
xcorr = np.abs(xcorr)
xcorr = np.nanmean(xcorr, axis=-2)
xcorr = np.max(xcorr, axis=-1)
xcorr = np.clip(xcorr, 0, 1)
return 1 - xcorr
def correlate(a, b, axis=-1):
A = np.fft.rfft(a, axis=axis)
B = np.fft.rfft(b, axis=axis)
X = A * np.conj(B)
x = np.fft.irfft(X)
x = np.fft.fftshift(x, axes=axis)
norm = np.sqrt(
a.shape[-1] * np.var(a, axis=axis)
*
b.shape[-1] * np.var(b, axis=axis)
)
norm = norm[..., np.newaxis]
return np.nan_to_num(x / norm, neginf=0, posinf=0)
Note: If your distance cannot be easily vectorized, the code below implements a generic loop that applies the necessary broadcasting rules and calls the distance function in singleton fashion (i.e., on individual pairs of objects).
def generic_distance(a, b, axis=-1):
# Build the output array with broadcasting rules applied.
shape = np.broadcast_shapes(a.shape, b.shape)
axis = axis if axis > -1 else len(shape) + axis
shape = shape[:axis] + shape[axis+1:]
output = np.empty(shape)
n_dim = output.ndim
# Loop over elements and compute distances serially.
for ijk in np.ndindex(*output.shape):
ijk_a = tuple([ijk[i] if a.shape[i] != 1 else 0 for i in range(len(ijk))])
ijk_b = tuple([ijk[i] if b.shape[i] != 1 else 0 for i in range(len(ijk))])
output[ijk] = dist(a[ijk_a], b[ijk_b])
return output
def dist(a, b, axis=-1):
'''
Return the distance between objects a and b.
'''
return np.linalg.norm(a - b, axis=axis)
The fastmap
module provides an abstract base class fastmap.FastMapABC
that is not intended to be used directly. The user should define a child class that adds a _distance_function
attribute to the abstract base class. Implementing the model this way supports models persistence.
class FastMap(fastmap.FastMapABC):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._distance_func = correlation_distance
with h5py.File('data/ridgecrest.hdf5', mode='r') as in_file:
X_train = in_file['/X/train'][:]
y_train = in_file['/y/train'][:]
X_test = in_file['/X/test'][:]
y_test = in_file['/y/test'][:]
The FastMapSVM model benefits from rescaling the extracted features before SVM classification.
n_dim = 2 # The number of dimensions for the Euclidean embedding.
fastmapsvm = sklearn.pipeline.Pipeline([
('fastmap', FastMap(n_dim)),
('scaler', sklearn.preprocessing.StandardScaler()),
('svc', sklearn.svm.SVC())
])
fastmapsvm.fit(X_train, y_train);
fastmapsvm.score(X_train, y_train)
Sub-components of the Pipeline can be extracted via indexing.
# Compute the embedding of the train data
W = fastmapsvm[0].transform(X_train)
plt.close('all')
fig, ax = plt.subplots()
for label in range(2):
idxs = np.argwhere(y_train == label).flatten()
ax.scatter(W[idxs, 0], W[idxs, 1])
ax.set_aspect(1)
ax.set_xlabel('$w_0$')
ax.set_ylabel('$w_1$')
# For bigger data sets, it is helpful to have a progress bar
fastmapsvm['fastmap'].show_progress = True
fastmapsvm.score(X_test, y_test)
y_hat = fastmapsvm.predict(X_test)
sklearn.metrics.ConfusionMatrixDisplay.from_predictions(
y_test,
y_hat
)
import joblib
joblib.dump(fastmapsvm, 'fastmapsvm.joblib')
del(fastmapsvm)
Note: The distance function must be defined within the scope in which you load the model. So, if you train the model in one script, and then try to reload it in another script in which the distance function is not defined, it will not work. For instance, if you restart your Python kernel and immediately run the cell below, before running anything else, you will get AttributeError: Can't get attribute 'FastMap' on <module '__main__'>
. There is, to my knowledge, no clean way of storing user-defined code and its dependencies alongside the model. The abstract base class paradigm is used to force users to write code in a way that will make it easier for them to reload the model later.
If you restart your Python kernel, you need to run the code in sections 2.1 and 2.2 (along with the necessary imports) before running the code in this section.
import joblib
fastmapsvm = joblib.load('fastmapsvm.joblib')
FastMap was originally designed for unsupervised cluster analysis, and can be trained in unsupervised mode by omitting the labels on training.
fm = FastMap(2)
fm.fit(X_train)
W = fm.transform(X_train)
plt.close('all')
fig, ax = plt.subplots()
ax.scatter(W[:, 0], W[:, 1])
- Faloutsos, C. & Lin, K.-I. FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. Sigmod Rec 24, 163–174 (1995).
- White, M. C. A., Sharma, K., Li, A., Kumar, T. K. S. & Nakata, N. Classifying seismograms using the FastMap algorithm and support-vector machines. Commun. Eng. 2, 46 (2023).