🚧 Work In Progress..
Python implementation of Concentration Free Outlier Factor (CFOF) [1].
- Concentration free
- Does not suffer of the hubness problem
- Semi–locality
- fast-CFOF algorithm allows to calculate reliably CFOF scores with linear cost both in the dataset size and dimensionality
To install the latest release:
$ pip install cfof
Import CFOF
and FastCFOF
.
>>> from cfof import CFOF, FastCFOF
>>> import numpy as np
Load data.
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Instantiate CFOF
or FastCFOF
, then call .compute(X)
to calculate the scores. .compute(X)
returns sc
, where sc[i, l]
is score of object i
for ϱ_l
(rhos[l]).
You can also calculate CFOF scores from a precomputed distance matrix using
.compute_from_distance_matrix()
.
Use compute
to compute CFOF scores directly from data.
>>> cfof_clf = CFOF(metric='euclidean', rhos=[0.5, 0.6], n_jobs=1)
>>> cfof_clf.compute(X)
array([[0.5 , 0.66666667],
[0.33333333, 0.83333333],
[0.5 , 1. ],
[0.5 , 0.66666667],
[0.33333333, 0.83333333],
[0.5 , 1. ]])
Use compute_from_distance_matrix
to compute CFOF scores from a precomputed
distance matrix.
>>> from sklearn.metrics import pairwise_distances
>>> distance_matrix = pairwise_distances(X, metric='euclidean')
>>> cfof_clf.compute_from_distance_matrix(distance_matrix)
array([[0.5 , 0.66666667],
[0.33333333, 0.83333333],
[0.5 , 1. ],
[0.5 , 0.66666667],
[0.33333333, 0.83333333],
[0.5 , 1. ]])
Use compute
to compute CFOF scores directly from data.
>>> np.random.seed(10)
>>> X = np.random.randint(0, 100, size=(1000, 3))
>>>
>>> fast_cfof_clf = FastCFOF(metric='euclidean',
... rhos=[0.001, 0.005, 0.01, 0.05, 0.1],
... epsilon=0.1, delta=0.1, n_bins=50, n_jobs=1)
>>> fast_cfof_clf.compute(X)
array([[0.00954095, 0.00954095, 0.01930698, 0.05963623, 0.10481131],
[0.00954095, 0.00954095, 0.01930698, 0.06866488, 0.10481131],
[0.00954095, 0.00954095, 0.02559548, 0.06866488, 0.10481131],
...,
[0.00954095, 0.00954095, 0.01930698, 0.05963623, 0.10481131],
[0.00954095, 0.00954095, 0.03393222, 0.15998587, 0.24420531],
[0.00954095, 0.00954095, 0.02559548, 0.0390694 , 0.09102982]])
Use compute_from_distance_matrix
to compute CFOF scores from a precomputed
distance matrix.
>>> from sklearn.metrics import pairwise_distances
>>> distance_matrix = pairwise_distances(X, metric='euclidean')
>>> fast_cfof_clf.compute_from_distance_matrix(distance_matrix)
array([[0.00954095, 0.00954095, 0.01930698, 0.05963623, 0.10481131],
[0.00954095, 0.00954095, 0.01930698, 0.06866488, 0.10481131],
[0.00954095, 0.00954095, 0.02559548, 0.06866488, 0.10481131],
...,
[0.00954095, 0.00954095, 0.01930698, 0.05963623, 0.10481131],
[0.00954095, 0.00954095, 0.03393222, 0.15998587, 0.24420531],
[0.00954095, 0.00954095, 0.02559548, 0.0390694 , 0.09102982]])
This library provides a wrapper for pyCFOFiSAX [2]
>>> from cfof.cfof_isax import CFOFiSAXWrapper
Refer to pyCFOFiSAX
documentation
for more details.
- Add support for
faiss
(GPU). - Parallelize FastCFOF.
- Add unit tests.
- Add benchmarks.
- Wrap pyCFOFiSAX.
[1] ANGIULLI, Fabrizio. CFOF: a concentration free measure for anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 2020, vol. 14, no 1, p. 1-53.
[2] FOULON, Lucas, FENET, Serge, RIGOTTI, Christophe, et al. Scoring Message Stream Anomalies in Railway Communication Systems. In : 2019 International Conference on Data Mining Workshops (ICDMW). IEEE, 2019. p. 769-776.