diff --git a/components/outlier-detection/README.md b/components/outlier-detection/README.md index 9cf68d476a..987d98a553 100644 --- a/components/outlier-detection/README.md +++ b/components/outlier-detection/README.md @@ -2,8 +2,7 @@ ## Description -[Anomaly or outlier detection](https://en.wikipedia.org/wiki/Anomaly_detection) has many applications, ranging from preventing credit card fraud to detecting computer network intrusions. Seldon Core provides a number of outlier detectors suitable for different use cases. The detectors can be run as a model which is one of the pre-defined types of [predictive units](../../docs/reference/seldon-deployment.md#proto-buffer-definition) in Seldon Core. It is a microservice that makes predictions and can receive feedback rewards. The REST and gRPC internal APIs that the model components must conform to are covered in the [internal API](../../docs/reference/internal-api.md#model) reference. - +[Anomaly or outlier detection](https://en.wikipedia.org/wiki/Anomaly_detection) has many applications, ranging from preventing credit card fraud to detecting computer network intrusions. Seldon Core provides a number of outlier detectors suitable for different use cases. The detectors can be run as models or transformers which are part of the pre-defined types of [predictive units](../../docs/reference/seldon-deployment.md#proto-buffer-definition) in Seldon Core. Models are microservices that make predictions and can receive feedback rewards while the input transformers add the anomaly predictions to the metadata of the underlying model. The REST and gRPC internal APIs that the model and transformer components must conform to are covered in the [internal API](../../docs/reference/internal-api.md) reference. ## Implementations @@ -15,10 +14,20 @@ The following types of outlier detectors are implemented and showcased with demo The Sequence-to-Sequence LSTM algorithm can be used to detect outliers in time series data, while the other algorithms spot anomalies in tabular data. The Mahalanobis detector works online and does not need to be trained first. The other algorithms are ideally trained on a batch of normal data or data with a low fraction of outliers. +## Implementing custom outlier detectors + +An outlier detection component can be implemented either as a model or input transformer component. If the component is defined as a model, a ```predict``` method needs to be implemented to return the detected anomalies. Optionally, a ```send_feedback``` method can return additional information about the performance of the algorithm. When the component is used as a transformer, the anomaly predictions will occur in the ```transform_input``` method which returns the unchanged input features. The anomaly predictions will then be added to the underlying model's metadata via the ```tags``` method. Both models and transformers can make use of custom metrics defined by the ```metrics``` function. + +The required methods to use the outlier detection algorithms as models or transformers are implemented in the Python files with the ```Core``` prefix. The demos contain clear instructions on how to run your component as a model or transformer. + ## Language specific templates -A reference template for custom model components written in several languages are available: -* [Python](../../wrappers/s2i/python/test/model-template-app/MyModel.py) -* [R](../../wrappers/s2i/R/test/model-template-app/MyModel.R) +Reference templates for custom model and input transformer components written in several languages are available: +* Python + * [model](../../wrappers/s2i/python/test/model-template-app/MyModel.py) + * [transformer](../../wrappers/s2i/python/test/transformer-template-app/MyTransformer.py) +* R + * [model](../../wrappers/s2i/R/test/model-template-app/MyModel.R) + * [transformer](../../wrappers/s2i/R/test/transformer-template-app/MyTransformer.R) Additionally, the [wrappers](../../wrappers/s2i) provide guidelines for implementing the model component in other languages. \ No newline at end of file diff --git a/components/outlier-detection/isolation-forest/CoreIsolationForest.py b/components/outlier-detection/isolation-forest/CoreIsolationForest.py new file mode 100644 index 0000000000..0db0fca41c --- /dev/null +++ b/components/outlier-detection/isolation-forest/CoreIsolationForest.py @@ -0,0 +1,117 @@ +import logging +import numpy as np +import pickle +from sklearn.ensemble import IsolationForest + +logger = logging.getLogger(__name__) + +class CoreIsolationForest(object): + """ Outlier detection using Isolation Forests. + + Parameters + ---------- + threshold (float) : anomaly score threshold; scores below threshold are outliers + + Functions + ---------- + predict : detect and return outliers + transform_input : detect outliers and return input features + send_feedback : add target labels as part of the feedback loop + tags : add metadata for input transformer + metrics : return custom metrics + """ + + def __init__(self,threshold=0.,model_name='if',load_path='./models/'): + + logger.info("Initializing model") + self.threshold = threshold + self.N = 0 # total sample count up until now + self.nb_outliers = 0 + + # load pre-trained model + with open(load_path + model_name + '.pickle', 'rb') as f: + self.clf = pickle.load(f) + + + def predict(self, X, feature_names): + """ Return outlier predictions. + + Parameters + ---------- + X : array-like + feature_names : array of feature names (optional) + """ + logger.info("Using component as a model") + return self._get_preds(X) + + + def transform_input(self, X, feature_names): + """ Transform the input. + Used when the outlier detector sits on top of another model. + + Parameters + ---------- + X : array-like + feature_names : array of feature names (optional) + """ + logger.info("Using component as an outlier-detector transformer") + self.prediction_meta = self._get_preds(X) + return X + + + def _get_preds(self,X): + """ Detect outliers below the anomaly score threshold. + + Parameters + ---------- + X : array-like + """ + self.decision_val = self.clf.decision_function(X) # anomaly scores + + # make prediction + self.prediction = (self.decision_val < self.threshold).astype(int) # scores below threshold are outliers + + self.N+=self.prediction.shape[0] # update counter + + return self.prediction + + + def send_feedback(self,X,feature_names,reward,truth): + """ Return additional data as part of the feedback loop. + + Parameters + ---------- + X : array of the features sent in the original predict request + feature_names : array of feature names. May be None if not available. + reward (float): the reward + truth : array with correct value (optional) + """ + logger.info("Send feedback called") + return [] + + + def tags(self): + """ + Use predictions made within transform to add these as metadata + to the response. Tags will only be collected if the component is + used as an input-transformer. + """ + try: + return {"outlier-predictions": self.prediction_meta.tolist()} + except AttributeError: + logger.info("No metadata about outliers") + + + def metrics(self): + """ Return custom metrics averaged over the prediction batch. + """ + self.nb_outliers += np.sum(self.prediction) + + is_outlier = {"type":"GAUGE","key":"is_outlier","value":np.mean(self.prediction)} + anomaly_score = {"type":"GAUGE","key":"anomaly_score","value":np.mean(self.decision_val)} + nb_outliers = {"type":"GAUGE","key":"nb_outliers","value":int(self.nb_outliers)} + fraction_outliers = {"type":"GAUGE","key":"fraction_outliers","value":int(self.nb_outliers)/self.N} + obs = {"type":"GAUGE","key":"observation","value":self.N} + threshold = {"type":"GAUGE","key":"threshold","value":self.threshold} + + return [is_outlier,anomaly_score,nb_outliers,fraction_outliers,obs,threshold] \ No newline at end of file diff --git a/components/outlier-detection/isolation-forest/OutlierIsolationForest.py b/components/outlier-detection/isolation-forest/OutlierIsolationForest.py index 1c66f53810..a56ba32085 100644 --- a/components/outlier-detection/isolation-forest/OutlierIsolationForest.py +++ b/components/outlier-detection/isolation-forest/OutlierIsolationForest.py @@ -1,28 +1,23 @@ import numpy as np -import pickle -from sklearn.ensemble import IsolationForest +from CoreIsolationForest import CoreIsolationForest from utils import flatten, performance, outlier_stats -class OutlierIsolationForest(object): +class OutlierIsolationForest(CoreIsolationForest): """ Outlier detection using Isolation Forests. - Arguments: - - threshold (float): anomaly score threshold; scores below threshold are outliers + Parameters + ---------- + threshold (float) : anomaly score threshold; scores below threshold are outliers - Functions: - - predict: detect and return outliers - - send_feedback: add target labels as part of the feedback loop - - metrics: return custom metrics + Functions + ---------- + send_feedback : add target labels as part of the feedback loop + metrics : return custom metrics """ - def __init__(self,threshold=0.,load_path='./models/'): + def __init__(self,threshold=0.,model_name='if',load_path='./models/'): - self.threshold = threshold - self.N = 0 # total sample count up until now - - # load pre-trained model - with open(load_path + 'model.pickle', 'rb') as f: - self.clf = pickle.load(f) + super().__init__(threshold=threshold, model_name=model_name, load_path=load_path) self._predictions = [] self._labels = [] @@ -30,39 +25,31 @@ def __init__(self,threshold=0.,load_path='./models/'): self.roll_window = 100 self.metric = [float('nan') for i in range(18)] - def predict(self,X,feature_names): - """ Detect outliers from mse using the threshold. + + def send_feedback(self,X,feature_names,reward,truth): + """ Return outlier labels as part of the feedback loop. - Arguments: - - X: input data - - feature_names + Parameters + ---------- + X : array of the features sent in the original predict request + feature_names : array of feature names. May be None if not available. + reward (float): the reward + truth : array with correct value (optional) """ - self.decision_val = self.clf.decision_function(X) # anomaly scores + _ = super().send_feedback(X,feature_names,reward,truth) + + # historical reconstruction errors and predictions self._anomaly_score.append(self.decision_val) self._anomaly_score = flatten(self._anomaly_score) - - # make prediction - self.prediction = (self.decision_val < self.threshold).astype(int) # scores below threshold are outliers self._predictions.append(self.prediction) self._predictions = flatten(self._predictions) - self.N+=self.prediction.shape[0] # update counter - - return self.prediction - - def send_feedback(self,X,feature_names,reward,truth): - """ Return outlier labels as part of the feedback loop. - - Arguments: - - X: input data - - feature_names - - reward - - truth: outlier labels - """ + # target labels self.label = truth self._labels.append(self.label) self._labels = flatten(self._labels) + # performance metrics scores = performance(self._labels,self._predictions,roll_window=self.roll_window) stats = outlier_stats(self._labels,self._predictions,roll_window=self.roll_window) @@ -71,8 +58,9 @@ def send_feedback(self,X,feature_names,reward,truth): for c in convert: # convert from np to native python type to jsonify metric.append(np.asscalar(np.asarray(c))) self.metric = metric - - return + + return [] + def metrics(self): """ Return custom metrics. @@ -87,8 +75,8 @@ def metrics(self): dec_val = float('nan') y_true = float('nan') else: - pred = int(self._predictions[-2]) - dec_val = self._anomaly_score[-2] + pred = int(self._predictions[-1]) + dec_val = self._anomaly_score[-1] y_true = int(self.label[0]) is_outlier = {"type":"GAUGE","key":"is_outlier","value":pred} diff --git a/components/outlier-detection/isolation-forest/README.md b/components/outlier-detection/isolation-forest/README.md index b57767233a..0ec8c10f10 100644 --- a/components/outlier-detection/isolation-forest/README.md +++ b/components/outlier-detection/isolation-forest/README.md @@ -6,10 +6,17 @@ ## Implementation -The Isolation Forest is trained by running the ```train.py``` script. The ```OutlierIsolationForest``` class loads a pre-trained model and makes predictions on new data. +The Isolation Forest is trained by running the ```train.py``` script. The ```OutlierIsolationForest``` class inherits from ```CoreIsolationForest``` which loads a pre-trained model and can make predictions on new data. -A detailed explanation of the implementation and usage of Isolation Forests as outlier detectors can be found in the [isolation_forest_doc](./isolation_forest_doc.ipynb) notebook. +A detailed explanation of the implementation and usage of Isolation Forests as outlier detectors can be found in the [isolation forest doc](./doc.md). ## Running on Seldon -An end-to-end example running an Isolation Forest outlier detector on GCP or Minikube using Seldon to identify computer network intrusions is available [here](./isolation_forest.ipynb). \ No newline at end of file +An end-to-end example running an Isolation Forest outlier detector on GCP or Minikube using Seldon to identify computer network intrusions is available [here](./isolation_forest.ipynb). + +Docker images to use the generic Isolation Forest outlier detector as a model or transformer can be found on Docker Hub: +* [seldonio/outlier-if-model](https://hub.docker.com/r/seldonio/outlier-if-model) +* [seldonio/outlier-if-transformer](https://hub.docker.com/r/seldonio/outlier-if-transformer) + +A model docker image specific for the demo is also available: +* [seldonio/outlier-if-model-demo](https://hub.docker.com/r/seldonio/outlier-if-model-demo) \ No newline at end of file diff --git a/components/outlier-detection/isolation-forest/doc.md b/components/outlier-detection/isolation-forest/doc.md new file mode 100644 index 0000000000..111341ed7f --- /dev/null +++ b/components/outlier-detection/isolation-forest/doc.md @@ -0,0 +1,130 @@ +# Isolation Forest (IF) Algorithm Documentation + +The aim of this document is to explain the Isolation Forest algorithm in Seldon's outlier detection framework. + +First, we provide a high level overview of the algorithm and the use case, then we will give a detailed explanation of the implementation. + +## Overview + +Outlier detection has many applications, ranging from preventing credit card fraud to detecting computer network intrusions. The available data is typically unlabeled and detection needs to be done in real-time. The outlier detector can be used as a standalone algorithm, or to detect anomalies in the input data of another predictive model. + +The IF outlier detection algorithm predicts whether the input features are an outlier or not, dependent on a threshold level set by the user. The algorithm needs to be pretrained first on a representable batch of data. + +As observations arrive, the algorithm will: +- calculate an anomaly score for the observation +- predict that the observation is an outlier if the anomaly score is below the threshold level + +## Why Isolation Forests? + +Isolation forests are tree based models specifically used for outlier detection. The IF isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. The number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node. This path length, averaged over a forest of random trees, is a measure of normality and is used to define an anomaly score. Outliers can typically be isolated quicker, leading to shorter paths. In the scikit-learn implementation, lower anomaly scores indicate that the probability of an observation being an outlier is higher. + +## Implementation + +### 1. Defining and training the IF model + +The model takes 4 hyperparameters: + +- contamination: the fraction of expected outliers in the data set +- number of estimators: the number of base estimators; number of trees in the forest +- max samples: fraction of samples used for each base estimator +- max features: fraction of features used for each base estimator + +``` python +!python train.py \ +--dataset 'kddcup99' \ +--samples 50000 \ +--keep_cols "$cols_str" \ +--contamination .1 \ +--n_estimators 100 \ +--max_samples .8 \ +--max_features 1. \ +--save_path './models/' +``` + +The model is saved in the folder specified by "save_path". + +### 2. Making predictions + +In order to make predictions, which can then be served by Seldon Core, the pre-trained model is loaded when defining an OutlierIsolationForest object. The "threshold" argument defines below which anomaly score a sample is classified as an outlier. The threshold is a key hyperparameter and needs to be picked carefully for each application. The OutlierIsolationForest class inherits from the CoreIsolationForest class in ```CoreIsolationForest.py```. + +``` python +class CoreIsolationForest(object): + """ Outlier detection using Isolation Forests. + + Parameters + ---------- + threshold (float) : anomaly score threshold; scores below threshold are outliers + + Functions + ---------- + predict : detect and return outliers + transform_input : detect outliers and return input features + send_feedback : add target labels as part of the feedback loop + tags : add metadata for input transformer + metrics : return custom metrics + """ + + def __init__(self,threshold=0.,load_path='./models/'): + + logger.info("Initializing model") + self.threshold = threshold + self.N = 0 # total sample count up until now + self.nb_outliers = 0 + + # load pre-trained model + with open(load_path + 'model.pickle', 'rb') as f: + self.clf = pickle.load(f) +``` + +```python +class OutlierIsolationForest(CoreIsolationForest): + """ Outlier detection using Isolation Forests. + + Parameters + ---------- + threshold (float) : anomaly score threshold; scores below threshold are outliers + + Functions + ---------- + send_feedback : add target labels as part of the feedback loop + metrics : return custom metrics + """ + def __init__(self,threshold=0.,load_path='./models/'): + + super().__init__(threshold=threshold, load_path=load_path) +``` + +The actual outlier detection is done by the ```_get_preds``` method which is invoked by ```predict``` or ```transform_input``` dependent on whether the detector is defined as respectively a model or a transformer. + +``` python +def predict(self, X, feature_names): + """ Return outlier predictions. + + Parameters + ---------- + X : array-like + feature_names : array of feature names (optional) + """ + logger.info("Using component as a model") + return self._get_preds(X) +``` + +```python +def transform_input(self, X, feature_names): + """ Transform the input. + Used when the outlier detector sits on top of another model. + + Parameters + ---------- + X : array-like + feature_names : array of feature names (optional) + """ + logger.info("Using component as an outlier-detector transformer") + self.prediction_meta = self._get_preds(X) + return X +``` + +## References + +Scikit-learn Isolation Forest: +- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html \ No newline at end of file diff --git a/components/outlier-detection/isolation-forest/isolation_forest.ipynb b/components/outlier-detection/isolation-forest/isolation_forest.ipynb index ed80c52b06..24eb0c67a7 100644 --- a/components/outlier-detection/isolation-forest/isolation_forest.ipynb +++ b/components/outlier-detection/isolation-forest/isolation_forest.ipynb @@ -73,7 +73,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -93,24 +93,9 @@ }, { "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Load dataset\n", - "\n", - "Generate training batch\n", - "\n", - "Train outlier detector\n", - "\n", - "Training done!\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "!python train.py \\\n", "--dataset 'kddcup99' \\\n", @@ -130,6 +115,37 @@ "## Test using Kubernetes cluster on GCP or Minikube" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Run the outlier detector as a model or a transformer. If you want to run the anomaly detector as a transformer, change the SERVICE_TYPE variable from MODEL to TRANSFORMER [here](./.s2i/environment), set MODEL = False and change ```OutlierIsolationForest.py``` to:\n", + "\n", + "```python\n", + "from CoreIsolationForest import CoreIsolationForest\n", + "\n", + "class OutlierIsolationForest(CoreIsolationForest):\n", + " \"\"\" Outlier detection using Isolation Forests.\n", + "\n", + " Parameters\n", + " ----------\n", + " threshold (float) : anomaly score threshold; scores below threshold are outliers\n", + " \"\"\"\n", + " def __init__(self,threshold=0.,load_path='./models/'):\n", + "\n", + " super().__init__(threshold=threshold, load_path=load_path)\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "MODEL = True" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -139,29 +155,20 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "minikube = False" + "MINIKUBE = True" ] }, { "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Fetching cluster endpoint and auth data.\n", - "kubeconfig entry generated for standard-cluster-1.\n" - ] - } - ], - "source": [ - "if minikube:\n", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if MINIKUBE:\n", " !minikube start --memory 4096 --feature-gates=CustomResourceValidation=true \\\n", " --extra-config=apiserver.Authorization.Mode=RBAC\n", "else:\n", @@ -177,17 +184,9 @@ }, { "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "clusterrolebinding.rbac.authorization.k8s.io/kube-system-cluster-admin created\r\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "!kubectl create clusterrolebinding kube-system-cluster-admin --clusterrole=cluster-admin \\\n", "--serviceaccount=kube-system:default" @@ -195,17 +194,9 @@ }, { "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "namespace/seldon created\r\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "!kubectl create namespace seldon" ] @@ -219,17 +210,9 @@ }, { "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Context \"gke_seldon-demos_europe-west1-b_standard-cluster-1\" modified.\r\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "!kubectl config set-context $(kubectl config current-context) --namespace=seldon" ] @@ -243,22 +226,9 @@ }, { "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "serviceaccount/tiller created\n", - "clusterrolebinding.rbac.authorization.k8s.io/tiller created\n", - "$HELM_HOME has been configured at /home/arnaud/.helm.\n", - "\n", - "Tiller (the Helm server-side component) has been installed into your Kubernetes Cluster.\n", - "Happy Helming!\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "!kubectl -n kube-system create sa tiller\n", "!kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller\n", @@ -274,69 +244,20 @@ }, { "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Waiting for deployment \"tiller-deploy\" rollout to finish: 0 of 1 updated replicas are available...\n", - "deployment \"tiller-deploy\" successfully rolled out\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "!kubectl rollout status deploy/tiller-deploy -n kube-system" ] }, { "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "NAME: seldon-core-crd\n", - "LAST DEPLOYED: Tue Dec 4 14:53:17 2018\n", - "NAMESPACE: seldon\n", - "STATUS: DEPLOYED\n", - "\n", - "RESOURCES:\n", - "==> v1beta1/Deployment\n", - "NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE\n", - "seldon-spartakus-volunteer 1 0 0 0 1s\n", - "\n", - "==> v1beta1/CustomResourceDefinition\n", - "NAME KIND\n", - "seldondeployments.machinelearning.seldon.io CustomResourceDefinition.v1beta1.apiextensions.k8s.io\n", - "\n", - "==> v1/ServiceAccount\n", - "NAME SECRETS AGE\n", - "seldon-spartakus-volunteer 1 0s\n", - "\n", - "==> v1beta1/ClusterRole\n", - "NAME AGE\n", - "seldon-spartakus-volunteer 0s\n", - "\n", - "==> v1beta1/ClusterRoleBinding\n", - "NAME AGE\n", - "seldon-spartakus-volunteer 0s\n", - "\n", - "==> v1/ConfigMap\n", - "NAME DATA AGE\n", - "seldon-spartakus-config 3 1s\n", - "\n", - "\n", - "NOTES:\n", - "NOTES: TODO\n", - "\n", - "\n" - ] - } - ], + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], "source": [ "!helm install ../../../helm-charts/seldon-core-crd --name seldon-core-crd \\\n", " --set usage_metrics.enabled=true" @@ -344,70 +265,11 @@ }, { "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "NAME: seldon-core\n", - "LAST DEPLOYED: Tue Dec 4 14:53:19 2018\n", - "NAMESPACE: seldon\n", - "STATUS: DEPLOYED\n", - "\n", - "RESOURCES:\n", - "==> v1/ClusterRoleBinding\n", - "NAME KIND SUBJECTS\n", - "seldon-seldon ClusterRoleBinding.v1.rbac.authorization.k8s.io 1 item(s)\n", - "\n", - "==> v1beta1/Role\n", - "NAME AGE\n", - "ambassador 1s\n", - "seldon-local 1s\n", - "\n", - "==> v1beta1/RoleBinding\n", - "NAME AGE\n", - "ambassador 1s\n", - "\n", - "==> v1/RoleBinding\n", - "NAME KIND SUBJECTS\n", - "seldon RoleBinding.v1.rbac.authorization.k8s.io 1 item(s)\n", - "\n", - "==> v1/Service\n", - "NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE\n", - "seldon-core-ambassador-admin 10.7.243.27 8877:30877/TCP 1s\n", - "seldon-core-ambassador 10.7.244.21 80:30645/TCP,443:32092/TCP 1s\n", - "seldon-core-seldon-apiserver 10.7.244.66 8080:32315/TCP,5000:32435/TCP 1s\n", - "seldon-core-redis 10.7.242.105 6379/TCP 1s\n", - "\n", - "==> v1beta1/Deployment\n", - "NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE\n", - "seldon-core-ambassador 1 1 1 0 1s\n", - "seldon-core-seldon-apiserver 1 1 1 0 1s\n", - "seldon-core-seldon-cluster-manager 1 1 1 0 1s\n", - "seldon-core-redis 1 1 1 0 1s\n", - "\n", - "==> v1/ServiceAccount\n", - "NAME SECRETS AGE\n", - "seldon 1 1s\n", - "\n", - "==> v1beta1/ClusterRole\n", - "NAME AGE\n", - "seldon-crd-seldon 1s\n", - "\n", - "\n", - "NOTES:\n", - "Thank you for installing Seldon Core.\n", - "\n", - "Documentation can be found at https://github.com/SeldonIO/seldon-core\n", - "\n", - "\n", - "\n", - "\n" - ] - } - ], + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], "source": [ "!helm install ../../../helm-charts/seldon-core --name seldon-core \\\n", " --namespace seldon \\\n", @@ -423,19 +285,9 @@ }, { "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Waiting for deployment \"seldon-core-seldon-cluster-manager\" rollout to finish: 0 of 1 updated replicas are available...\n", - "deployment \"seldon-core-seldon-cluster-manager\" successfully rolled out\n", - "deployment \"seldon-core-seldon-apiserver\" successfully rolled out\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "!kubectl rollout status deploy/seldon-core-seldon-cluster-manager -n seldon\n", "!kubectl rollout status deploy/seldon-core-seldon-apiserver -n seldon" @@ -445,56 +297,60 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If Minikube used: create docker image for outlier detector inside Minikube using s2i." + "If Minikube used: create docker image for outlier detector inside Minikube using s2i. Besides the transformer image and the demo specific model image, the general model image for the Isolation Forest outlier detector is also available from Docker Hub as ***seldonio/outlier-if-model:0.1***." ] }, { "cell_type": "code", - "execution_count": 12, - "metadata": {}, + "execution_count": null, + "metadata": { + "scrolled": true + }, "outputs": [], "source": [ - "if minikube:\n", - " !eval $(minikube docker-env) && s2i build . seldonio/seldon-core-s2i-python3:0.4-SNAPSHOT seldonio/outlier-if:0.1" + "if MINIKUBE & MODEL:\n", + " !eval $(minikube docker-env) && \\\n", + " s2i build . seldonio/seldon-core-s2i-python3:0.4 seldonio/outlier-if-model-demo:0.1\n", + "elif MINIKUBE:\n", + " !eval $(minikube docker-env) && \\\n", + " s2i build . seldonio/seldon-core-s2i-python3:0.4 seldonio/outlier-if-transformer:0.1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Install outlier detector helm charts and set \"threshold\" hyperparameter value." + "Install outlier detector helm charts either as a model or transformer and set *threshold* hyperparameter value." ] }, { "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "NAME: outlier-detector\n", - "LAST DEPLOYED: Tue Dec 4 14:54:28 2018\n", - "NAMESPACE: seldon\n", - "STATUS: DEPLOYED\n", - "\n", - "RESOURCES:\n", - "==> v1alpha2/SeldonDeployment\n", - "NAME KIND\n", - "outlier-detector SeldonDeployment.v1alpha2.machinelearning.seldon.io\n", - "\n", - "\n" - ] - } - ], - "source": [ - "!helm install ../../../helm-charts/seldon-od-if \\\n", - " --set model.image.name=seldonio/outlier-if:0.1 \\\n", - " --set model.threshold=0.04 \\\n", - " --name outlier-detector --set oauth.key=oauth-key \\\n", - " --set oauth.secret=oauth-secret \\\n", - " --namespace=seldon" + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if MODEL:\n", + " !helm install ../../../helm-charts/seldon-od-model \\\n", + " --name outlier-detector \\\n", + " --namespace=seldon \\\n", + " --set model.type=isolationforest \\\n", + " --set model.isolationforest.image.name=seldonio/outlier-if-model-demo:0.1 \\\n", + " --set model.isolationforest.threshold=0 \\\n", + " --set oauth.key=oauth-key \\\n", + " --set oauth.secret=oauth-secret \\\n", + " --set replicas=1\n", + "else:\n", + " !helm install ../../../helm-charts/seldon-od-transformer \\\n", + " --name outlier-detector \\\n", + " --namespace=seldon \\\n", + " --set outlierDetection.enabled=true \\\n", + " --set outlierDetection.name=outlier-if \\\n", + " --set outlierDetection.type=isolationforest \\\n", + " --set outlierDetection.isolationforest.image.name=seldonio/outlier-if-transformer:0.1 \\\n", + " --set outlierDetection.isolationforest.threshold=0 \\\n", + " --set oauth.key=oauth-key \\\n", + " --set oauth.secret=oauth-secret \\\n", + " --set model.image.name=seldonio/mock_classifier:1.0" ] }, { @@ -524,21 +380,13 @@ }, { "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "(4898431, 53)\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "from utils import get_payload, rest_request_ambassador, send_feedback_rest, get_kdd_data, generate_batch\n", "\n", - "data = get_kdd_data(keep_cols=cols) # load dataset\n", + "data = get_kdd_data(keep_cols=cols,percent10=True) # load dataset\n", "print(data.shape)" ] }, @@ -551,18 +399,9 @@ }, { "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "(1, 52)\n", - "(1,)\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "import numpy as np\n", "\n", @@ -582,7 +421,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -591,183 +430,30 @@ }, { "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "200\n", - "{\n", - " \"meta\": {\n", - " \"puid\": \"1qlvfh1d89upve1707bd9co8d5\",\n", - " \"tags\": {\n", - " },\n", - " \"routing\": {\n", - " },\n", - " \"requestPath\": {\n", - " \"outlier-if\": \"seldonio/outlier-if:0.1\"\n", - " },\n", - " \"metrics\": [{\n", - " \"key\": \"is_outlier\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"anomaly_score\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"observation\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": 0.0,\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"threshold\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": 0.04,\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"label\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"accuracy_tot\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"precision_tot\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"recall_tot\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"f1_tot\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"f2_tot\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"accuracy_roll\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"precision_roll\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"recall_roll\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"f1_roll\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"f2_roll\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"true_negative\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"false_positive\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"false_negative\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"true_positive\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"nb_outliers_roll\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"nb_labels_roll\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"nb_outliers_tot\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }, {\n", - " \"key\": \"nb_labels_tot\",\n", - " \"type\": \"GAUGE\",\n", - " \"value\": \"NaN\",\n", - " \"tags\": {\n", - " }\n", - " }]\n", - " },\n", - " \"data\": {\n", - " \"names\": [],\n", - " \"ndarray\": [0.0]\n", - " }\n", - "}\n" - ] - } - ], + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], "source": [ "response = rest_request_ambassador(\"outlier-detector\",request,endpoint=\"localhost:8003\")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the outlier detector is used as a transformer, the output of the anomaly detection is added as part of the metadata. If it is used as a model, we send model feedback to retrieve custom performance metrics." + ] + }, { "cell_type": "code", - "execution_count": 18, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "send_feedback_rest(\"outlier-detector\",request,response,0,labels,endpoint=\"localhost:8003\")" + "if MODEL:\n", + " send_feedback_rest(\"outlier-detector\",request,response,0,labels,endpoint=\"localhost:8003\")" ] }, { @@ -786,71 +472,11 @@ }, { "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "NAME: seldon-core-analytics\n", - "LAST DEPLOYED: Tue Dec 4 14:57:24 2018\n", - "NAMESPACE: seldon\n", - "STATUS: DEPLOYED\n", - "\n", - "RESOURCES:\n", - "==> v1/ServiceAccount\n", - "NAME SECRETS AGE\n", - "prometheus 1 2s\n", - "\n", - "==> v1beta1/ClusterRole\n", - "NAME AGE\n", - "prometheus 2s\n", - "\n", - "==> v1/Job\n", - "NAME DESIRED SUCCESSFUL AGE\n", - "grafana-prom-import-dashboards 1 0 2s\n", - "\n", - "==> v1beta1/Deployment\n", - "NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE\n", - "alertmanager-deployment 1 1 1 0 2s\n", - "grafana-prom-deployment 1 1 1 0 2s\n", - "prometheus-deployment 1 1 1 0 2s\n", - "\n", - "==> v1beta1/DaemonSet\n", - "NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE-SELECTOR AGE\n", - "prometheus-node-exporter 1 1 0 1 0 2s\n", - "\n", - "==> v1/ConfigMap\n", - "NAME DATA AGE\n", - "alertmanager-server-conf 1 2s\n", - "grafana-import-dashboards 9 2s\n", - "prometheus-rules 0 2s\n", - "prometheus-server-conf 1 2s\n", - "\n", - "==> v1beta1/ClusterRoleBinding\n", - "NAME AGE\n", - "prometheus 2s\n", - "\n", - "==> v1/Service\n", - "NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE\n", - "alertmanager 10.7.248.9 80/TCP 2s\n", - "grafana-prom 10.7.255.122 80:31668/TCP 2s\n", - "prometheus-node-exporter None 9100/TCP 2s\n", - "prometheus-seldon 10.7.254.117 80/TCP 2s\n", - "\n", - "==> v1/Secret\n", - "NAME TYPE DATA AGE\n", - "grafana-prom-secret Opaque 1 2s\n", - "\n", - "\n", - "NOTES:\n", - "NOTES: TODO\n", - "\n", - "\n" - ] - } - ], + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], "source": [ "!helm install ../../../helm-charts/seldon-core-analytics --name seldon-core-analytics \\\n", " --set grafana_prom_admin_password=password \\\n", @@ -903,16 +529,16 @@ "- Sample random network intrusion data with a certain outlier probability.\n", "- Get payload for the observation.\n", "- Make a prediction.\n", - "- Send the \"true\" label with the feedback.\n", + "- Send the \"true\" label with the feedback if the detector is run as a model.\n", "\n", "It is important that the prediction-feedback order is maintained. Otherwise there will be a mismatch between the predicted and \"true\" labels.\n", "\n", - "View the progress on the grafana \"Outlier Detection\" dashboard." + "View the progress on the grafana \"Outlier Detection\" dashboard. Most metrics need the outlier detector to be run as a model since they need model feedback." ] }, { "cell_type": "code", - "execution_count": 21, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -924,8 +550,19 @@ " X, labels = generate_batch(data,samples,fraction_outlier)\n", " request = get_payload(X)\n", " response = rest_request_ambassador(\"outlier-detector\",request,endpoint=\"localhost:8003\")\n", - " send_feedback_rest(\"outlier-detector\",request,response,0,labels,endpoint=\"localhost:8003\")\n", - " #time.sleep(1)" + " if MODEL:\n", + " send_feedback_rest(\"outlier-detector\",request,response,0,labels,endpoint=\"localhost:8003\")\n", + " time.sleep(1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if MINIKUBE:\n", + " !minikube delete" ] }, { @@ -952,7 +589,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.4" + "version": "3.6.6" } }, "nbformat": 4, diff --git a/components/outlier-detection/isolation-forest/isolation_forest_doc.ipynb b/components/outlier-detection/isolation-forest/isolation_forest_doc.ipynb deleted file mode 100644 index c523924c96..0000000000 --- a/components/outlier-detection/isolation-forest/isolation_forest_doc.ipynb +++ /dev/null @@ -1,210 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Isolation Forest (IF) Algorithm Documentation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The aim of this document is to explain the Isolation Forest algorithm in Seldon's outlier detection framework.\n", - "\n", - "First, we provide a high level overview of the algorithm and the use case, then we will give a detailed explanation of the implementation." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Overview" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Outlier detection has many applications, ranging from preventing credit card fraud to detecting computer network intrusions. The available data is typically unlabeled and detection needs to be done in real-time. The outlier detector can be used as a standalone algorithm, or to detect anomalies in the input data of another predictive model.\n", - "\n", - "The IF outlier detection algorithm predicts whether the input features are an outlier or not, dependent on a threshold level set by the user. The algorithm needs to be pretrained first on a representable batch of data.\n", - "\n", - "As observations arrive, the algorithm will:\n", - "- calculate an anomaly score for the observation\n", - "- predict that the observation is an outlier if the anomaly score is below the threshold level" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Why Isolation Forests?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Isolation forests are tree based models specifically used for outlier detection. The IF isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. The number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node. This path length, averaged over a forest of random trees, is a measure of normality and is used to define an anomaly score. Outliers can typically be isolated quicker, leading to shorter paths. In the scikit-learn implementation, lower anomaly scores indicate that the probability of an observation being an outlier is higher." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Implementation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1. Defining and training the IF model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The model takes 4 hyperparameters:\n", - "\n", - "- contamination: the fraction of expected outliers in the data set\n", - "- number of estimators: the number of base estimators; number of trees in the forest\n", - "- max samples: fraction of samples used for each base estimator\n", - "- max features: fraction of features used for each base estimator" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "``` python\n", - "!python train.py \\\n", - "--dataset 'kddcup99' \\\n", - "--samples 50000 \\\n", - "--keep_cols \"$cols_str\" \\\n", - "--contamination .1 \\\n", - "--n_estimators 100 \\\n", - "--max_samples .8 \\\n", - "--max_features 1. \\\n", - "--save_path './models/'\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The model is saved in the folder specified by \"save_path\"." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2. Making predictions" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In order to make predictions, which can then be served by Seldon Core, the pre-trained model is loaded when defining an OutlierIsolationForest object. The \"threshold\" argument defines below which anomaly score a sample is classified as an outlier. The threshold is a key hyperparameter and needs to be picked carefully for each application." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "``` python\n", - "class OutlierIsolationForest(object):\n", - " \"\"\" Outlier detection using Isolation Forests.\n", - " \n", - " Arguments:\n", - " - threshold (float): anomaly score threshold; scores below threshold are outliers\n", - " \n", - " Functions:\n", - " - predict: detect and return outliers\n", - " - send_feedback: add target labels as part of the feedback loop\n", - " - metrics: return custom metrics\n", - " \"\"\"\n", - " def __init__(self,threshold=0.,load_path='./models/'):\n", - " \n", - " self.threshold = threshold\n", - " self.N = 0 # total sample count up until now\n", - " \n", - " # load pre-trained model\n", - " with open(load_path + 'model.pickle', 'rb') as f:\n", - " self.clf = pickle.load(f)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The predict method does the actual outlier detection." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "``` python\n", - " def predict(self,X,feature_names):\n", - " \"\"\" Detect outliers from mse using the threshold. \n", - " \n", - " Arguments:\n", - " - X: input data\n", - " - feature_names\n", - " \"\"\"\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## References" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Scikit-learn Isolation Forest:\n", - "- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/components/outlier-detection/isolation-forest/requirements.txt b/components/outlier-detection/isolation-forest/requirements.txt index 3d6ca9e456..772f39f27b 100644 --- a/components/outlier-detection/isolation-forest/requirements.txt +++ b/components/outlier-detection/isolation-forest/requirements.txt @@ -1,6 +1,6 @@ numpy==1.14.5 argparse==1.1 pandas==0.23.4 -scikit-learn==0.19.1 +scikit-learn==0.20.1 scipy==1.1.0 requests>=2.20.0 \ No newline at end of file diff --git a/components/outlier-detection/isolation-forest/train.py b/components/outlier-detection/isolation-forest/train.py index be0e71f685..e799a37120 100644 --- a/components/outlier-detection/isolation-forest/train.py +++ b/components/outlier-detection/isolation-forest/train.py @@ -13,13 +13,6 @@ # default args DATASET = 'kddcup99' SAMPLES = 50000 -SAVE = True -SAVE_PATH = './models/' -# Isolation Forest hyperparameters -CONTAMINATION = .1 -N_ESTIMATORS = 50 -MAX_SAMPLES = .8 -MAX_FEATURES = 1. COLS = str(['duration','protocol_type','flag','src_bytes','dst_bytes','land','wrong_fragment','urgent','hot', 'num_failed_logins','logged_in','num_compromised','root_shell','su_attempted','num_root','num_file_creations', 'num_shells','num_access_files','num_outbound_cmds','is_host_login','is_guest_login','count','srv_count', @@ -27,6 +20,15 @@ 'srv_diff_host_rate','dst_host_count','dst_host_srv_count','dst_host_same_srv_rate','dst_host_diff_srv_rate', 'dst_host_same_src_port_rate','dst_host_srv_diff_host_rate','dst_host_serror_rate','dst_host_srv_serror_rate', 'dst_host_rerror_rate','dst_host_srv_rerror_rate','target']) +MODEL_NAME = 'if' +SAVE = True +SAVE_PATH = './models/' + +# Isolation Forest hyperparameters +CONTAMINATION = .1 +N_ESTIMATORS = 50 +MAX_SAMPLES = .8 +MAX_FEATURES = 1. def train(X,args): """ Fit Isolation Forest. """ @@ -36,7 +38,7 @@ def train(X,args): clf.fit(X) if args.save: # save model - with open(args.save_path + 'model.pickle', 'wb') as f: + with open(args.save_path + args.model_name + '.pickle', 'wb') as f: pickle.dump(clf,f) def run(args): @@ -67,6 +69,7 @@ def run(args): parser.add_argument('--n_estimators',type=int,default=N_ESTIMATORS) parser.add_argument('--max_samples',type=float,default=MAX_SAMPLES) parser.add_argument('--max_features',type=float,default=MAX_FEATURES) + parser.add_argument('--model_name',type=str,default=MODEL_NAME) parser.add_argument('--save', default=SAVE, action='store_false') parser.add_argument('--save_path',type=str,default=SAVE_PATH) args = parser.parse_args() diff --git a/components/outlier-detection/mahalanobis/CoreMahalanobis.py b/components/outlier-detection/mahalanobis/CoreMahalanobis.py new file mode 100644 index 0000000000..ac90c553de --- /dev/null +++ b/components/outlier-detection/mahalanobis/CoreMahalanobis.py @@ -0,0 +1,192 @@ +import logging +import numpy as np +from scipy.linalg import eigh + +logger = logging.getLogger(__name__) + +class CoreMahalanobis(object): + """ Outlier detection using the Mahalanobis distance. + + Parameters + ---------- + threshold (float) : Mahalanobis distance threshold used to classify outliers + n_components (int) : number of principal components used + n_stdev (float) : stdev used for feature-wise clipping of observations + start_clip (int) : number of observations before clipping is applied + max_n (int) : algorithm behaves as if it has seen at most max_n points + + Functions + ---------- + predict : detect and return outliers + transform_input : detect outliers and return input features + send_feedback : add target labels as part of the feedback loop + tags : add metadata for input transformer + metrics : return custom metrics + """ + def __init__(self,threshold=25,n_components=3,n_stdev=3,start_clip=50,max_n=-1): + + logger.info("Initializing model") + self.threshold = threshold + self.n_components = n_components + self.max_n = max_n + self.n_stdev = n_stdev + self.start_clip = start_clip + + self.clip = None + self.mean = 0 + self.C = 0 + self.n = 0 + self.nb_outliers = 0 + + + def predict(self, X, feature_names): + """ Return outlier predictions. + + Parameters + ---------- + X : array-like + feature_names : array of feature names (optional) + """ + logger.info("Using component as a model") + return self._get_preds(X) + + + def transform_input(self, X, feature_names): + """ Transform the input. + Used when the outlier detector sits on top of another model. + + Parameters + ---------- + X : array-like + feature_names : array of feature names (optional) + """ + logger.info("Using component as an outlier-detector transformer") + self.prediction_meta = self._get_preds(X) + return X + + + def _get_preds(self,X): + """ Detect outliers using the Mahalanobis distance threshold. + + Parameters + ---------- + X : array-like + """ + + nb = X.shape[0] # batch size + p = X.shape[1] # number of features + n_components = min(self.n_components,p) + if self.max_n>0: + n = min(self.n,self.max_n) # n can never be above max_n + else: + n = self.n + + # Clip X + if self.n > self.start_clip: + Xclip = np.clip(X,self.clip[0],self.clip[1]) + else: + Xclip = X + + # Tracking the mean and covariance matrix + roll_partial_means = Xclip.cumsum(axis=0)/(np.arange(nb)+1).reshape((nb,1)) + coefs = (np.arange(nb)+1.)/(np.arange(nb)+n+1.) + new_means = self.mean + coefs.reshape((nb,1))*(roll_partial_means-self.mean) + new_means_offset = np.empty_like(new_means) + new_means_offset[0] = self.mean + new_means_offset[1:] = new_means[:-1] + + coefs = ((n+np.arange(nb))/(n+np.arange(nb)+1.)).reshape((nb,1,1)) + B = coefs*np.matmul((Xclip - new_means_offset)[:,:,None],(Xclip - new_means_offset)[:,None,:]) + cov_batch = (n-1.)/(n+max(1,nb-1.))*self.C + 1./(n+max(1,nb-1.))*B.sum(axis=0) + + # PCA + eigvals, eigvects = eigh(cov_batch,eigvals=(p-n_components,p-1)) + + # Projections + proj_x = np.matmul(X,eigvects) + proj_x_clip = np.matmul(Xclip,eigvects) + proj_means = np.matmul(new_means_offset,eigvects) + if type(self.C) == int and self.C == 0: + proj_cov = np.diag(np.zeros(n_components)) + else: + proj_cov = np.matmul(eigvects.transpose(),np.matmul(self.C,eigvects)) + + # Outlier detection in the PC subspace + coefs = (1./(n+np.arange(nb)+1.)).reshape((nb,1,1)) + B = coefs*np.matmul((proj_x_clip - proj_means)[:,:,None],(proj_x_clip - proj_means)[:,None,:]) + + all_C_inv = np.zeros_like(B) + c_inv = None + _EPSILON = 1e-8 + + for i, b in enumerate(B): + if c_inv is None: + if abs(np.linalg.det(proj_cov)) > _EPSILON: + c_inv = np.linalg.inv(proj_cov) + all_C_inv[i] = c_inv + continue + else: + if n + i == 0: + continue + proj_cov = (n + i -1. )/(n + i)*proj_cov + b + continue + else: + c_inv = (n + i - 1.)/float(n + i - 2.)*all_C_inv[i-1] + BC1 = np.matmul(B[i-1],c_inv) + all_C_inv[i] = c_inv - 1./(1.+np.trace(BC1))*np.matmul(c_inv,BC1) + + # Updates + self.mean = new_means[-1] + self.C = cov_batch + stdev = np.sqrt(np.diag(cov_batch)) + self.n += nb + if self.n > self.start_clip: + self.clip = [self.mean-self.n_stdev*stdev,self.mean+self.n_stdev*stdev] + + # Outlier scores and predictions + x_diff = proj_x-proj_means + self.score = np.matmul(x_diff[:,None,:],np.matmul(all_C_inv,x_diff[:,:,None])).reshape(nb) + self.prediction = np.array([1 if s > self.threshold else 0 for s in self.score]).astype(int) + + return self.prediction + + + def send_feedback(self,X,feature_names,reward,truth): + """ Return additional data as part of the feedback loop. + + Parameters + ---------- + X : array of the features sent in the original predict request + feature_names : array of feature names. May be None if not available. + reward (float): the reward + truth : array with correct value (optional) + """ + logger.info("Send feedback called") + return [] + + + def tags(self): + """ + Use predictions made within transform to add these as metadata + to the response. Tags will only be collected if the component is + used as an input-transformer. + """ + try: + return {"outlier-predictions": self.prediction_meta.tolist()} + except AttributeError: + logger.info("No metadata about outliers") + + + def metrics(self): + """ Return custom metrics averaged over the prediction batch. + """ + self.nb_outliers += np.sum(self.prediction) + + is_outlier = {"type":"GAUGE","key":"is_outlier","value":np.mean(self.prediction)} + outlier_score = {"type":"GAUGE","key":"outlier_score","value":np.mean(self.score)} + nb_outliers = {"type":"GAUGE","key":"nb_outliers","value":int(self.nb_outliers)} + fraction_outliers = {"type":"GAUGE","key":"fraction_outliers","value":int(self.nb_outliers)/self.n} + obs = {"type":"GAUGE","key":"observation","value":self.n} + threshold = {"type":"GAUGE","key":"threshold","value":self.threshold} + + return [is_outlier,outlier_score,nb_outliers,fraction_outliers,obs,threshold] \ No newline at end of file diff --git a/components/outlier-detection/mahalanobis/OutlierMahalanobis.py b/components/outlier-detection/mahalanobis/OutlierMahalanobis.py index 02da5147a4..916440289e 100644 --- a/components/outlier-detection/mahalanobis/OutlierMahalanobis.py +++ b/components/outlier-detection/mahalanobis/OutlierMahalanobis.py @@ -1,35 +1,28 @@ import numpy as np -from scipy.linalg import eigh +from CoreMahalanobis import CoreMahalanobis from utils import flatten, performance, outlier_stats -class OutlierMahalanobis(object): +class OutlierMahalanobis(CoreMahalanobis): """ Outlier detection using the Mahalanobis distance. - Arguments: - - threshold: (float): Mahalanobis distance threshold used to classify outliers - - n_components (int): number of principal components used - - n_stdev (float): stdev used for feature-wise clipping of observations - - start_clip (int): number of observations before clipping is applied - - max_n (int): algorithm behaves as if it has seen at most max_n points + Parameters + ---------- + threshold (float) : Mahalanobis distance threshold used to classify outliers + n_components (int) : number of principal components used + n_stdev (float) : stdev used for feature-wise clipping of observations + start_clip (int) : number of observations before clipping is applied + max_n (int) : algorithm behaves as if it has seen at most max_n points - Functions: - - predict: detect and return outliers - - send_feedback: add target labels as part of the feedback loop - - metrics: return custom metrics + Functions + ---------- + send_feedback : add target labels as part of the feedback loop + metrics : return custom metrics """ def __init__(self,threshold=25,n_components=3,n_stdev=3,start_clip=50,max_n=-1): - self.threshold = threshold - self.n_components = n_components - self.max_n = max_n - self.n_stdev = n_stdev - self.start_clip = start_clip - - self.clip = None - self.mean = 0 - self.C = 0 - self.n = 0 + super().__init__(threshold=threshold,n_components=n_components,n_stdev=n_stdev, + start_clip=start_clip,max_n=max_n) self._predictions = [] self._labels = [] @@ -37,111 +30,31 @@ def __init__(self,threshold=25,n_components=3,n_stdev=3,start_clip=50,max_n=-1): self.roll_window = 100 self.metric = [float('nan') for i in range(18)] - def predict(self,X,feature_names): - """ Detect outliers using the Mahalanobis distance threshold. + + def send_feedback(self,X,feature_names,reward,truth): + """ Return outlier labels as part of the feedback loop. - Arguments: - - X: input data - - feature_names + Parameters + ---------- + X : array of the features sent in the original predict request + feature_names : array of feature names. May be None if not available. + reward (float): the reward + truth : array with correct value (optional) """ - - nb = X.shape[0] # batch size - p = X.shape[1] # number of features - n_components = min(self.n_components,p) - if self.max_n>0: - n = min(self.n,self.max_n) # n can never be above max_n - else: - n = self.n + _ = super().send_feedback(X,feature_names,reward,truth) - # Clip X - if self.n > self.start_clip: - Xclip = np.clip(X,self.clip[0],self.clip[1]) - else: - Xclip = X - - # Tracking the mean and covariance matrix - roll_partial_means = Xclip.cumsum(axis=0)/(np.arange(nb)+1).reshape((nb,1)) - coefs = (np.arange(nb)+1.)/(np.arange(nb)+n+1.) - new_means = self.mean + coefs.reshape((nb,1))*(roll_partial_means-self.mean) - new_means_offset = np.empty_like(new_means) - new_means_offset[0] = self.mean - new_means_offset[1:] = new_means[:-1] - - coefs = ((n+np.arange(nb))/(n+np.arange(nb)+1.)).reshape((nb,1,1)) - B = coefs*np.matmul((Xclip - new_means_offset)[:,:,None],(Xclip - new_means_offset)[:,None,:]) - cov_batch = (n-1.)/(n+max(1,nb-1.))*self.C + 1./(n+max(1,nb-1.))*B.sum(axis=0) - - # PCA - eigvals, eigvects = eigh(cov_batch,eigvals=(p-n_components,p-1)) - - # Projections - proj_x = np.matmul(X,eigvects) - proj_x_clip = np.matmul(Xclip,eigvects) - proj_means = np.matmul(new_means_offset,eigvects) - if type(self.C) == int and self.C == 0: - proj_cov = np.diag(np.zeros(n_components)) - else: - proj_cov = np.matmul(eigvects.transpose(),np.matmul(self.C,eigvects)) - - # Outlier detection in the PC subspace - coefs = (1./(n+np.arange(nb)+1.)).reshape((nb,1,1)) - B = coefs*np.matmul((proj_x_clip - proj_means)[:,:,None],(proj_x_clip - proj_means)[:,None,:]) - - all_C_inv = np.zeros_like(B) - c_inv = None - _EPSILON = 1e-8 - - for i, b in enumerate(B): - if c_inv is None: - if abs(np.linalg.det(proj_cov)) > _EPSILON: - c_inv = np.linalg.inv(proj_cov) - all_C_inv[i] = c_inv - continue - else: - if n + i == 0: - continue - proj_cov = (n + i -1. )/(n + i)*proj_cov + b - continue - else: - c_inv = (n + i - 1.)/float(n + i - 2.)*all_C_inv[i-1] - BC1 = np.matmul(B[i-1],c_inv) - all_C_inv[i] = c_inv - 1./(1.+np.trace(BC1))*np.matmul(c_inv,BC1) - - # Updates - self.mean = new_means[-1] - self.C = cov_batch - stdev = np.sqrt(np.diag(cov_batch)) - self.n += nb - if self.n > self.start_clip: - self.clip = [self.mean-self.n_stdev*stdev,self.mean+self.n_stdev*stdev] - - # Outlier scores and predictions - x_diff = proj_x-proj_means - self.score = np.matmul(x_diff[:,None,:],np.matmul(all_C_inv,x_diff[:,:,None])).reshape(nb) - self.prediction = np.array([1 if s > self.threshold else 0 for s in self.score]).astype(int) - - # update outlier scores and prediction list + # historical reconstruction errors and predictions self._scores.append(self.score) self._scores = flatten(self._scores) self._predictions.append(self.prediction) self._predictions = flatten(self._predictions) - - return self.prediction - - - def send_feedback(self,X,feature_names,reward,truth): - """ Return outlier labels as part of the feedback loop. - Arguments: - - X: input data - - feature_names - - reward - - truth: outlier labels - """ + # target labels self.label = truth self._labels.append(self.label) self._labels = flatten(self._labels) + # performance metrics scores = performance(self._labels,self._predictions,roll_window=self.roll_window) stats = outlier_stats(self._labels,self._predictions,roll_window=self.roll_window) @@ -150,8 +63,8 @@ def send_feedback(self,X,feature_names,reward,truth): for c in convert: # convert from np to native python type to jsonify metric.append(np.asscalar(np.asarray(c))) self.metric = metric - - return + + return [] def metrics(self): @@ -167,8 +80,8 @@ def metrics(self): err = float('nan') y_true = float('nan') else: - pred = int(self._predictions[-2]) - err = self._scores[-2] + pred = int(self._predictions[-1]) + err = self._scores[-1] y_true = int(self.label[0]) is_outlier = {"type":"GAUGE","key":"is_outlier","value":pred} diff --git a/components/outlier-detection/mahalanobis/README.md b/components/outlier-detection/mahalanobis/README.md index 08cf9afa6c..bf95332b2c 100644 --- a/components/outlier-detection/mahalanobis/README.md +++ b/components/outlier-detection/mahalanobis/README.md @@ -8,8 +8,15 @@ The Mahalanobis online outlier detector aims to predict anomalies in tabular dat ## Implementation -The algorithm is implemented in the ```OutlierMahalanobis``` class and a detailed explanation of the implementation and usage of the algorithm to spot anomalies can be found in the [outlier_mahalanobis_doc](./outlier_mahalanobis_doc.ipynb) notebook. +The algorithm is implemented in the ```CoreOutlierMahalanobis``` class and a detailed explanation of the implementation and usage of the algorithm to spot anomalies can be found in the [mahalanobis doc](./doc.ipynb). ## Running on Seldon -An end-to-end example running a Mahalanobis outlier detector on GCP or Minikube using Seldon to identify computer network intrusions is available [here](./outlier_mahalanobis.ipynb). \ No newline at end of file +An end-to-end example running a Mahalanobis outlier detector on GCP or Minikube using Seldon to identify computer network intrusions is available [here](./outlier_mahalanobis.ipynb). + +Docker images to use the generic Mahalanobis outlier detector as a model or transformer can be found on Docker Hub: +* [seldonio/outlier-mahalanobis-model](https://hub.docker.com/r/seldonio/outlier-mahalanobis-model) +* [seldonio/outlier-mahalanobis-transformer](https://hub.docker.com/r/seldonio/outlier-mahalanobis-transformer) + +A model docker image specific for the demo is also available: +* [seldonio/outlier-mahalanobis-model-demo](https://hub.docker.com/r/seldonio/outlier-mahalanobis-model-demo) \ No newline at end of file diff --git a/components/outlier-detection/mahalanobis/outlier_mahalanobis_doc.ipynb b/components/outlier-detection/mahalanobis/doc.ipynb similarity index 94% rename from components/outlier-detection/mahalanobis/outlier_mahalanobis_doc.ipynb rename to components/outlier-detection/mahalanobis/doc.ipynb index b5813ec135..76c608a5c6 100644 --- a/components/outlier-detection/mahalanobis/outlier_mahalanobis_doc.ipynb +++ b/components/outlier-detection/mahalanobis/doc.ipynb @@ -71,7 +71,7 @@ "metadata": {}, "source": [ "```python\n", - "class OutlierMahalanobis(object):\n", + "class CoreMahalanobis(object):\n", " def __init__(self,threshold=25,n_components=3,n_stdev=3,start_clip=50,max_n=-1):\n", " \n", " self.threshold = threshold\n", @@ -114,12 +114,12 @@ "metadata": {}, "source": [ "```python\n", - "def predict(self,X,feature_names):\n", - " \"\"\" Detect outliers using the Mahalanobis distance threshold. \n", + "def _get_preds(self,X):\n", + " \"\"\" Detect outliers using the Mahalanobis distance threshold. \n", "\n", - " Arguments:\n", - " - X: input data\n", - " - feature_names\n", + " Parameters\n", + " ----------\n", + " X : array-like\n", " \"\"\"\n", "\n", " nb = X.shape[0] # batch size\n", @@ -205,24 +205,6 @@ "## Second Step: PCA and projection" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```python\n", - " # PCA\n", - " eigvals, eigvects = eigh(cov_batch,eigvals=(p-n_components,p-1))\n", - "\n", - " # Projections\n", - " proj_features = np.matmul(features,eigvects)\n", - " proj_means = np.matmul(new_means_offset,eigvects)\n", - " if type(self.C) == int and self.C == 0:\n", - " proj_cov = np.diag(np.zeros(n_components))\n", - " else:\n", - " proj_cov = np.matmul(eigvects.transpose(),np.matmul(self.C,eigvects))\n", - "```" - ] - }, { "cell_type": "markdown", "metadata": {}, diff --git a/components/outlier-detection/mahalanobis/outlier_mahalanobis.ipynb b/components/outlier-detection/mahalanobis/outlier_mahalanobis.ipynb index e3b5ef3cff..ee5b7fdcd2 100644 --- a/components/outlier-detection/mahalanobis/outlier_mahalanobis.ipynb +++ b/components/outlier-detection/mahalanobis/outlier_mahalanobis.ipynb @@ -69,6 +69,42 @@ "## Test using Kubernetes cluster on GCP or Minikube" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Run the outlier detector as a model or a transformer. If you want to run the anomaly detector as a transformer, change the SERVICE_TYPE variable from MODEL to TRANSFORMER [here](./.s2i/environment), set MODEL = False and change ```OutlierMahalanobis.py``` to:\n", + "\n", + "```python\n", + "from CoreMahalanobis import CoreMahalanobis\n", + "\n", + "class OutlierMahalanobis(CoreMahalanobis):\n", + " \"\"\" Outlier detection using the Mahalanobis distance.\n", + " \n", + " Parameters\n", + " ----------\n", + " threshold (float) : Mahalanobis distance threshold used to classify outliers\n", + " n_components (int) : number of principal components used\n", + " n_stdev (float) : stdev used for feature-wise clipping of observations\n", + " start_clip (int) : number of observations before clipping is applied\n", + " max_n (int) : algorithm behaves as if it has seen at most max_n points\n", + " \"\"\"\n", + " def __init__(self,threshold=25,n_components=3,n_stdev=3,start_clip=50,max_n=-1):\n", + " \n", + " super().__init__(threshold=threshold,n_components=n_components,n_stdev=n_stdev,\n", + " start_clip=start_clip,max_n=max_n)\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "MODEL = True" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -82,7 +118,7 @@ "metadata": {}, "outputs": [], "source": [ - "minikube = True" + "MINIKUBE = True" ] }, { @@ -93,7 +129,7 @@ }, "outputs": [], "source": [ - "if minikube:\n", + "if MINIKUBE:\n", " !minikube start --memory 4096 --feature-gates=CustomResourceValidation=true \\\n", " --extra-config=apiserver.Authorization.Mode=RBAC\n", "else:\n", @@ -224,7 +260,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If Minikube used: create docker image for outlier detector inside Minikube using s2i." + "If Minikube used: create docker image for outlier detector inside Minikube using s2i. Besides the transformer image and the demo specific model image, the general model image for the Mahalanobis outlier detector is also available from Docker Hub as ***seldonio/outlier-mahalanobis-model:0.1***." ] }, { @@ -235,15 +271,19 @@ }, "outputs": [], "source": [ - "if minikube:\n", - " !eval $(minikube docker-env) && s2i build . seldonio/seldon-core-s2i-python3:0.4 seldonio/outlier-mahalanobis:0.1" + "if MINIKUBE & MODEL:\n", + " !eval $(minikube docker-env) && \\\n", + " s2i build . seldonio/seldon-core-s2i-python3:0.4 seldonio/outlier-mahalanobis-model-demo:0.1\n", + "elif MINIKUBE:\n", + " !eval $(minikube docker-env) && \\\n", + " s2i build . seldonio/seldon-core-s2i-python3:0.4 seldonio/outlier-mahalanobis-transformer:0.1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Install outlier detector helm charts and set *threshold*, *n_components*, *n_stdev* and *start_clip* hyperparameter values." + "Install outlier detector helm charts either as a model or transformer and set *threshold*, *n_components*, *n_stdev* and *start_clip* hyperparameter values." ] }, { @@ -252,15 +292,34 @@ "metadata": {}, "outputs": [], "source": [ - "!helm install ../../../helm-charts/seldon-od-md \\\n", - " --set model.image.name=seldonio/outlier-mahalanobis:0.1 \\\n", - " --set model.threshold=25 \\\n", - " --set model.n_components=3 \\\n", - " --set model.n_stdev=3 \\\n", - " --set model.start_clip=50 \\\n", - " --name outlier-detector --set oauth.key=oauth-key \\\n", - " --set oauth.secret=oauth-secret \\\n", - " --namespace=seldon" + "if MODEL:\n", + " !helm install ../../../helm-charts/seldon-od-model \\\n", + " --name outlier-detector \\\n", + " --namespace=seldon \\\n", + " --set model.type=mahalanobis \\\n", + " --set model.mahalanobis.image.name=seldonio/outlier-mahalanobis-model-demo:0.1 \\\n", + " --set model.mahalanobis.threshold=25 \\\n", + " --set model.mahalanobis.n_components=3 \\\n", + " --set model.mahalanobis.n_stdev=3 \\\n", + " --set model.mahalanobis.start_clip=50 \\\n", + " --set oauth.key=oauth-key \\\n", + " --set oauth.secret=oauth-secret \\\n", + " --set replicas=1\n", + "else:\n", + " !helm install ../../../helm-charts/seldon-od-transformer \\\n", + " --name outlier-detector \\\n", + " --namespace=seldon \\\n", + " --set outlierDetection.enabled=true \\\n", + " --set outlierDetection.name=outlier-mahalanobis \\\n", + " --set outlierDetection.type=mahalanobis \\\n", + " --set outlierDetection.mahalanobis.image.name=seldonio/outlier-mahalanobis-transformer:0.1 \\\n", + " --set outlierDetection.mahalanobis.threshold=25 \\\n", + " --set outlierDetection.mahalanobis.n_components=3 \\\n", + " --set outlierDetection.mahalanobis.n_stdev=3 \\\n", + " --set outlierDetection.mahalanobis.start_clip=50 \\\n", + " --set oauth.key=oauth-key \\\n", + " --set oauth.secret=oauth-secret \\\n", + " --set model.image.name=seldonio/mock_classifier:1.0" ] }, { @@ -349,13 +408,21 @@ "response = rest_request_ambassador(\"outlier-detector\",request,endpoint=\"localhost:8003\")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the outlier detector is used as a transformer, the output of the anomaly detection is added as part of the metadata. If it is used as a model, we send model feedback to retrieve custom performance metrics." + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "send_feedback_rest(\"outlier-detector\",request,response,0,labels,endpoint=\"localhost:8003\")" + "if MODEL:\n", + " send_feedback_rest(\"outlier-detector\",request,response,0,labels,endpoint=\"localhost:8003\")" ] }, { @@ -431,11 +498,11 @@ "- Sample random network intrusion data with a certain outlier probability.\n", "- Get payload for the observation.\n", "- Make a prediction.\n", - "- Send the \"true\" label with the feedback.\n", + "- Send the \"true\" label with the feedback if the detector is run as a model.\n", "\n", "It is important that the prediction-feedback order is maintained. Otherwise there will be a mismatch between the predicted and \"true\" labels.\n", "\n", - "View the progress on the grafana \"Outlier Detection\" dashboard." + "View the progress on the grafana \"Outlier Detection\" dashboard. Most metrics need the outlier detector to be run as a model since they need model feedback." ] }, { @@ -454,7 +521,8 @@ " X, labels = generate_batch(data,samples,fraction_outlier)\n", " request = get_payload(X)\n", " response = rest_request_ambassador(\"outlier-detector\",request,endpoint=\"localhost:8003\")\n", - " send_feedback_rest(\"outlier-detector\",request,response,0,labels,endpoint=\"localhost:8003\")\n", + " if MODEL:\n", + " send_feedback_rest(\"outlier-detector\",request,response,0,labels,endpoint=\"localhost:8003\")\n", " time.sleep(1)" ] }, @@ -464,7 +532,7 @@ "metadata": {}, "outputs": [], "source": [ - "if minikube:\n", + "if MINIKUBE:\n", " !minikube delete" ] }, diff --git a/components/outlier-detection/seq2seq-lstm/CoreSeq2SeqLSTM.py b/components/outlier-detection/seq2seq-lstm/CoreSeq2SeqLSTM.py new file mode 100644 index 0000000000..32036198c5 --- /dev/null +++ b/components/outlier-detection/seq2seq-lstm/CoreSeq2SeqLSTM.py @@ -0,0 +1,215 @@ +import logging +import numpy as np +import pickle +import random + +from model import model + +logger = logging.getLogger(__name__) + +class CoreSeq2SeqLSTM(object): + """ Outlier detection using a sequence-to-sequence (seq2seq) LSTM model. + + Parameters + ---------- + threshold (float): reconstruction error (mse) threshold used to classify outliers + reservoir_size (int) : number of observations kept in memory using reservoir sampling + + Functions + ---------- + reservoir_sampling : applies reservoir sampling to incoming data + predict : detect and return outliers + transform_input : detect outliers and return input features + send_feedback : add target labels as part of the feedback loop + tags : add metadata for input transformer + metrics : return custom metrics + """ + + def __init__(self,threshold=0.003,reservoir_size=50000,model_name='seq2seq',load_path='./models/'): + + logger.info("Initializing model") + self.threshold = threshold + self.reservoir_size = reservoir_size + self.batch = [] + self.N = 0 # total sample count up until now for reservoir sampling + self.nb_outliers = 0 + + # load model architecture parameters + with open(load_path + model_name + '.pickle', 'rb') as f: + self.timesteps, self.n_features, encoder_dim, decoder_dim, output_activation = pickle.load(f) + + # instantiate model + self.s2s, self.enc, self.dec = model(self.n_features,encoder_dim=encoder_dim, + decoder_dim=decoder_dim,output_activation=output_activation) + self.s2s.load_weights(load_path + model_name + '_weights.h5') # load pretrained model weights + self.s2s._make_predict_function() + self.enc._make_predict_function() + self.dec._make_predict_function() + + # load data preprocessing info + with open(load_path + 'preprocess_' + model_name + '.pickle', 'rb') as f: + preprocess = pickle.load(f) + self.preprocess, self.clip, self.axis = preprocess[:3] + if self.preprocess=='minmax': + self.xmin, self.xmax = preprocess[3:5] + self.min, self.max = preprocess[5:] + elif self.preprocess=='standardized': + self.mu, self.sigma = preprocess[3:] + + + def reservoir_sampling(self,X,update_stand=False): + """ Keep batch of data in memory using reservoir sampling. """ + for item in X: + self.N+=1 + if len(self.batch) < self.reservoir_size: + self.batch.append(item) + else: + s = int(random.random() * self.N) + if s < self.reservoir_size: + self.batch[s] = item + + if update_stand: + if self.preprocess=='minmax': + self.xmin = np.array(self.batch).min(axis=self.axis) + self.xmax = np.array(self.batch).max(axis=self.axis) + elif self.preprocess=='standardized': + self.mu = np.array(self.batch).mean(axis=self.axis) + self.sigma = np.array(self.batch).std(axis=self.axis) + return + + + def predict(self, X, feature_names): + """ Return outlier predictions. + + Parameters + ---------- + X : array-like + feature_names : array of feature names (optional) + """ + logger.info("Using component as a model") + return self._get_preds(X) + + + def transform_input(self, X, feature_names): + """ Transform the input. + Used when the outlier detector sits on top of another model. + + Parameters + ---------- + X : array-like + feature_names : array of feature names (optional) + """ + logger.info("Using component as an outlier-detector transformer") + self.prediction_meta = self._get_preds(X) + return X + + + def decode_sequence(self,input_seq): + """ Feed output of encoder to decoder and make sequential predictions. """ + + # use encoder the get state vectors + states_value = self.enc.predict(input_seq) + + # generate initial target sequence + target_seq = input_seq[0,0,:].reshape((1,1,self.n_features)) + + # sequential prediction of time series + decoded_seq = np.zeros((1, self.timesteps, self.n_features)) + decoded_seq[0,0,:] = target_seq[0,0,:] + i = 1 + while i < self.timesteps: + + decoder_output = self.dec.predict([target_seq] + states_value) + + # update the target sequence + target_seq = np.zeros((1, 1, self.n_features)) + target_seq[0, 0, :] = decoder_output[0] + + # update output + decoded_seq[0, i, :] = decoder_output[0] + + # update states + states_value = decoder_output[1:] + + i+=1 + + return decoded_seq + + + def _get_preds(self,X): + """ Detect outliers if the reconstruction error is above the threshold. + + Parameters + ---------- + X : array-like + """ + + # clip data per feature + for col,clip in enumerate(self.clip): + X[:,:,col] = np.clip(X[:,:,col],-clip,clip) + + # update reservoir + if self.N < self.reservoir_size: + update_stand = False + else: + update_stand = True + + self.reservoir_sampling(X,update_stand=update_stand) + + # apply scaling + if self.preprocess=='minmax': + X = ((X - self.xmin) / (self.xmax - self.xmin)) * (self.max - self.min) + self.min + elif self.preprocess=='standardized': + X = (X - self.mu) / (self.sigma + 1e-10) + + # make predictions + n_obs = X.shape[0] + self.mse = np.zeros(n_obs) + for obs in range(n_obs): + input_seq = X[obs:obs+1,:,:] + decoded_seq = self.decode_sequence(input_seq) + self.mse[obs] = np.mean(np.power(input_seq[0,:,:] - decoded_seq[0,:,:], 2)) + self.prediction = np.array([1 if e > self.threshold else 0 for e in self.mse]).astype(int) + + return self.prediction + + + def send_feedback(self,X,feature_names,reward,truth): + """ Return additional data as part of the feedback loop. + + Parameters + ---------- + X : array of the features sent in the original predict request + feature_names : array of feature names. May be None if not available. + reward (float): the reward + truth : array with correct value (optional) + """ + logger.info("Send feedback called") + return [] + + + def tags(self): + """ + Use predictions made within transform to add these as metadata + to the response. Tags will only be collected if the component is + used as an input-transformer. + """ + try: + return {"outlier-predictions": self.prediction_meta.tolist()} + except AttributeError: + logger.info("No metadata about outliers") + + + def metrics(self): + """ Return custom metrics averaged over the prediction batch. + """ + self.nb_outliers += np.sum(self.prediction) + + is_outlier = {"type":"GAUGE","key":"is_outlier","value":np.mean(self.prediction)} + mse = {"type":"GAUGE","key":"mse","value":np.mean(self.mse)} + nb_outliers = {"type":"GAUGE","key":"nb_outliers","value":int(self.nb_outliers)} + fraction_outliers = {"type":"GAUGE","key":"fraction_outliers","value":int(self.nb_outliers)/self.N} + obs = {"type":"GAUGE","key":"observation","value":self.N} + threshold = {"type":"GAUGE","key":"threshold","value":self.threshold} + + return [is_outlier,mse,nb_outliers,fraction_outliers,obs,threshold] \ No newline at end of file diff --git a/components/outlier-detection/seq2seq-lstm/OutlierSeq2SeqLSTM.py b/components/outlier-detection/seq2seq-lstm/OutlierSeq2SeqLSTM.py index 106232a0f7..6dd72afe2d 100644 --- a/components/outlier-detection/seq2seq-lstm/OutlierSeq2SeqLSTM.py +++ b/components/outlier-detection/seq2seq-lstm/OutlierSeq2SeqLSTM.py @@ -1,168 +1,57 @@ import numpy as np -import pickle -import random -from model import model + +from CoreSeq2SeqLSTM import CoreSeq2SeqLSTM from utils import flatten, performance, outlier_stats -class OutlierSeq2SeqLSTM(object): +class OutlierSeq2SeqLSTM(CoreSeq2SeqLSTM): """ Outlier detection using a sequence-to-sequence (seq2seq) LSTM model. - Arguments: - - threshold: (float): reconstruction error (mse) threshold used to classify outliers - - reservoir_size (int): number of observations kept in memory using reservoir sampling + Parameters + ---------- + threshold (float) : reconstruction error (mse) threshold used to classify outliers + reservoir_size (int) : number of observations kept in memory using reservoir sampling - Functions: - - reservoir_sampling: applies reservoir sampling to incoming data - - predict: detect and return outliers - - send_feedback: add target labels as part of the feedback loop - - metrics: return custom metrics + Functions + ---------- + send_feedback : add target labels as part of the feedback loop + metrics : return custom metrics """ def __init__(self,threshold=0.003,reservoir_size=50000,model_name='seq2seq',load_path='./models/'): - self.threshold = threshold - self.reservoir_size = reservoir_size - self.batch = [] - self.N = 0 # total sample count up until now for reservoir sampling - - # load model architecture parameters - with open(load_path + model_name + '.pickle', 'rb') as f: - self.timesteps, self.n_features, encoder_dim, decoder_dim, output_activation = pickle.load(f) - - # instantiate model - self.s2s, self.enc, self.dec = model(self.n_features,encoder_dim=encoder_dim, - decoder_dim=decoder_dim,output_activation=output_activation) - self.s2s.load_weights(load_path + model_name + '_weights.h5') # load pretrained model weights - self.s2s._make_predict_function() - self.enc._make_predict_function() - self.dec._make_predict_function() - - # load data preprocessing info - with open(load_path + 'preprocess_' + model_name + '.pickle', 'rb') as f: - preprocess = pickle.load(f) - self.preprocess, self.clip, self.axis = preprocess[:3] - if self.preprocess=='minmax': - self.xmin, self.xmax = preprocess[3:5] - self.min, self.max = preprocess[5:] - elif self.preprocess=='standardized': - self.mu, self.sigma = preprocess[3:] + super().__init__(threshold=threshold,reservoir_size=reservoir_size, + model_name=model_name,load_path=load_path) self._predictions = [] self._labels = [] self._mse = [] self.roll_window = 100 self.metric = [float('nan') for i in range(18)] + - - def reservoir_sampling(self,X,update_stand=False): - """ Keep batch of data in memory using reservoir sampling. """ - for item in X: - self.N+=1 - if len(self.batch) < self.reservoir_size: - self.batch.append(item) - else: - s = int(random.random() * self.N) - if s < self.reservoir_size: - self.batch[s] = item - - if update_stand: - if self.preprocess=='minmax': - self.xmin = np.array(self.batch).min(axis=self.axis) - self.xmax = np.array(self.batch).max(axis=self.axis) - elif self.preprocess=='standardized': - self.mu = np.array(self.batch).mean(axis=self.axis) - self.sigma = np.array(self.batch).std(axis=self.axis) - return - - - def decode_sequence(self,input_seq): - """ Feed output of encoder to decoder and make sequential predictions. """ - - # use encoder the get state vectors - states_value = self.enc.predict(input_seq) - - # generate initial target sequence - target_seq = input_seq[0,0,:].reshape((1,1,self.n_features)) - - # sequential prediction of time series - decoded_seq = np.zeros((1, self.timesteps, self.n_features)) - decoded_seq[0,0,:] = target_seq[0,0,:] - i = 1 - while i < self.timesteps: - - decoder_output = self.dec.predict([target_seq] + states_value) - - # update the target sequence - target_seq = np.zeros((1, 1, self.n_features)) - target_seq[0, 0, :] = decoder_output[0] - - # update output - decoded_seq[0, i, :] = decoder_output[0] - - # update states - states_value = decoder_output[1:] - - i+=1 - - return decoded_seq - - - def predict(self,X,feature_names): - """ Detect outliers from mse using the threshold. + def send_feedback(self,X,feature_names,reward,truth): + """ Return outlier labels as part of the feedback loop. - Arguments: - - X: input data - - feature_names + Parameters + ---------- + X : array of the features sent in the original predict request + feature_names : array of feature names. May be None if not available. + reward (float): the reward + truth : array with correct value (optional) """ + _ = super().send_feedback(X,feature_names,reward,truth) - # clip data per feature - for col,clip in enumerate(self.clip): - X[:,:,col] = np.clip(X[:,:,col],-clip,clip) - - # update reservoir - if self.N < self.reservoir_size: - update_stand = False - else: - update_stand = True - - self.reservoir_sampling(X,update_stand=update_stand) - - # apply scaling - if self.preprocess=='minmax': - X = ((X - self.xmin) / (self.xmax - self.xmin)) * (self.max - self.min) + self.min - elif self.preprocess=='standardized': - X = (X - self.mu) / (self.sigma + 1e-10) - - # make predictions - n_obs = X.shape[0] - self.mse = np.zeros(n_obs) - for obs in range(n_obs): - input_seq = X[obs:obs+1,:,:] - decoded_seq = self.decode_sequence(input_seq) - self.mse[obs] = np.mean(np.power(input_seq[0,:,:] - decoded_seq[0,:,:], 2)) - self.prediction = np.array([1 if e > self.threshold else 0 for e in self.mse]).astype(int) - - # update mse and prediction list + # historical reconstruction errors and predictions self._mse.append(self.mse) self._mse = flatten(self._mse) self._predictions.append(self.prediction) self._predictions = flatten(self._predictions) - return self.prediction - - - def send_feedback(self,X,feature_names,reward,truth): - """ Return outlier labels as part of the feedback loop. - - Arguments: - - X: input data - - feature_names - - reward - - truth: outlier labels - """ + # target labels self.label = truth self._labels.append(self.label) self._labels = flatten(self._labels) + # performance metrics scores = performance(self._labels,self._predictions,roll_window=self.roll_window) stats = outlier_stats(self._labels,self._predictions,roll_window=self.roll_window) @@ -171,8 +60,8 @@ def send_feedback(self,X,feature_names,reward,truth): for c in convert: # convert from np to native python type to jsonify metric.append(np.asscalar(np.asarray(c))) self.metric = metric - - return + + return [] def metrics(self): @@ -188,8 +77,8 @@ def metrics(self): err = float('nan') y_true = float('nan') else: - pred = int(self._predictions[-2]) - err = self._mse[-2] + pred = int(self._predictions[-1]) + err = self._mse[-1] y_true = int(self.label[0]) is_outlier = {"type":"GAUGE","key":"is_outlier","value":pred} diff --git a/components/outlier-detection/seq2seq-lstm/README.md b/components/outlier-detection/seq2seq-lstm/README.md index f3e7ccb396..fa18b1d4d1 100644 --- a/components/outlier-detection/seq2seq-lstm/README.md +++ b/components/outlier-detection/seq2seq-lstm/README.md @@ -10,8 +10,15 @@ The implemented seq2seq outlier detector aims to predict anomalies in a sequence The architecture of the seq2seq model is defined in ```model.py``` and it is trained by running the ```train.py``` script. The ```OutlierSeq2SeqLSTM``` class loads a pre-trained model and makes predictions on new data. -A detailed explanation of the implementation and usage of the seq2seq model as an outlier detector can be found in the [seq2seq_lstm_doc](./seq2seq_lstm_doc.ipynb) notebook. +A detailed explanation of the implementation and usage of the seq2seq model as an outlier detector can be found in the [seq2seq documentation](./doc.md). ## Running on Seldon -An end-to-end example running a seq2seq outlier detector on GCP or Minikube using Seldon to identify anomalies in ECGs is available [here](./seq2seq_lstm.ipynb). \ No newline at end of file +An end-to-end example running a seq2seq outlier detector on GCP or Minikube using Seldon to identify anomalies in ECGs is available [here](./seq2seq_lstm.ipynb). + +Docker images to use the generic Mahalanobis outlier detector as a model or transformer can be found on Docker Hub: +* [seldonio/outlier-s2s-lstm-model](https://hub.docker.com/r/seldonio/outlier-s2s-lstm-model) +* [seldonio/outlier-s2s-lstm-transformer](https://hub.docker.com/r/seldonio/outlier-s2s-lstm-transformer) + +A model docker image specific for the demo is also available: +* [seldonio/outlier-s2s-lstm-model-demo](https://hub.docker.com/r/seldonio/outlier-s2s-lstm-model-demo) \ No newline at end of file diff --git a/components/outlier-detection/seq2seq-lstm/doc.md b/components/outlier-detection/seq2seq-lstm/doc.md new file mode 100644 index 0000000000..d1911d311f --- /dev/null +++ b/components/outlier-detection/seq2seq-lstm/doc.md @@ -0,0 +1,336 @@ +# Sequence-to-Sequence LSTM (seq2seq-LSTM) Outlier Algorithm Documentation + +The aim of this document is to explain the seq2seq-LSTM algorithm in Seldon's outlier detection framework. + +First, we provide a high level overview of the algorithm and the use case, then we will give a detailed explanation of the implementation. + +## Overview + +Outlier detection has many applications, ranging from preventing credit card fraud to detecting computer network intrusions. The available data is typically unlabeled and detection needs to be done in real-time. The outlier detector can be used as a standalone algorithm, or to detect anomalies in the input data of another predictive model. + +The seq2seq-LSTM outlier detection algorithm is suitable for time series data and predicts whether a sequence of input features is an outlier or not, dependent on a threshold level set by the user. The algorithm needs to be pretrained first on a batch of -preferably- inliers. + +As observations arrive, the algorithm will: +- clip and scale the input features +- first encode, and then sequentially decode the input time series data in an attempt to reconstruct the initial observations +- compute a reconstruction error between the output of the decoder and the input data +- predict that the observation is an outlier if the error is larger than the threshold level + +## Why Sequence-to-Sequence Models? + +Seq2seq models convert sequences from one domain into sequences in another domain. A typical example would be sentence translation between different languages. A seq2seq model consists of 2 main building blocks: an encoder and a decoder. The encoder processes the input sequence and initializes the decoder. The decoder then makes sequential predictions for the output sequence. In our case, the decoder aims to reconstruct the input sequence. Both the encoder and decoder are typically implemented with recurrent or 1D convolutional neural networks. Our implementation uses a type of recurrent neural network called LSTM networks. An excellent explanation of how LSTM units work is available [here](http://colah.github.io/posts/2015-08-Understanding-LSTMs/). The loss function to be minimized with stochastic gradient descent is the mean squared error between the input and output sequence, and is called the reconstruction error. + +If we train the seq2seq model with inliers, it will be able to replicate new inlier data well with a low reconstruction error. However, if outliers are fed to the seq2seq model, the reconstruction error becomes large and we can classify the sequence as an anomaly. + +## Implementation + +The implementation is inspired by [this blog post](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html). + +### 1. Building the seq2seq-LSTM Model + +The seq2seq model definition in ```model.py``` takes 4 arguments that define the architecture: +- the number of features in the input +- a list with the number of units per [bidirectional](https://en.wikipedia.org/wiki/Bidirectional_recurrent_neural_networks) LSTM layer in the encoder +- a list with the number of units per LSTM layer in the decoder +- the output activation type for the dense output layer on top of the last LSTM unit in the decoder + +``` python +def model(n_features, encoder_dim = [20], decoder_dim = [20], dropout=0., learning_rate=.001, + loss='mean_squared_error', output_activation='sigmoid'): + """ Build seq2seq model. + + Arguments: + - n_features (int): number of features in the data + - encoder_dim (list): list with number of units per encoder layer + - decoder_dim (list): list with number of units per decoder layer + - dropout (float): dropout for LSTM units + - learning_rate (float): learning rate used during training + - loss (str): loss function used + - output_activation (str): activation type for the dense output layer in the decoder + """ +``` + +First, we define the bidirectional LSTM layers in the encoder and keep the state of the last LSTM unit to initialise the decoder: + +```python +# add encoder hidden layers +encoder_lstm = [] +for i in range(enc_dim-1): + encoder_lstm.append(Bidirectional(LSTM(encoder_dim[i], dropout=dropout, + return_sequences=True,name='encoder_lstm_' + str(i)))) + encoder_hidden = encoder_lstm[i](encoder_hidden) + +encoder_lstm.append(Bidirectional(LSTM(encoder_dim[-1], dropout=dropout, return_state=True, + name='encoder_lstm_' + str(enc_dim-1)))) +encoder_outputs, forward_h, forward_c, backward_h, backward_c = encoder_lstm[-1](encoder_hidden) + +# only need to keep encoder states +state_h = Concatenate()([forward_h, backward_h]) +state_c = Concatenate()([forward_c, backward_c]) +encoder_states = [state_h, state_c] +``` + +We can then define the LSTM units in the decoder, with the states initialised by the encoder: + +```python +# initialise decoder states with encoder states +decoder_lstm = [] +for i in range(dec_dim): + decoder_lstm.append(LSTM(decoder_dim[i], dropout=dropout, return_sequences=True, + return_state=True, name='decoder_lstm_' + str(i))) + decoder_hidden, _, _ = decoder_lstm[i](decoder_hidden, initial_state=encoder_states) +``` + +We add a dense layer with output activation of choice on top of the last LSTM layer in the decoder and compile the model: + +```python +# add linear layer on top of LSTM +decoder_dense = Dense(n_features, activation=output_activation, name='dense_output') +decoder_outputs = decoder_dense(decoder_hidden) + +# define seq2seq model +model = Model([encoder_inputs, decoder_inputs], decoder_outputs) +optimizer = Adam(lr=learning_rate) +model.compile(optimizer=optimizer, loss=loss) +``` + +The decoder predictions are sequential and we only need the encoder states to initialise the decoder for the first item in the sequence. From then on, the output and state of the decoder at each step in the sequence is used to predict the next item. As a result, we define separate encoder and decoder models for the prediction stage: + +```python +# define encoder model returning encoder states +encoder_model = Model(encoder_inputs, encoder_states * dec_dim) + +# define decoder model +# need state inputs for each LSTM layer +decoder_states_inputs = [] +for i in range(dec_dim): + decoder_state_input_h = Input(shape=(decoder_dim[i],), name='decoder_state_input_h_' + str(i)) + decoder_state_input_c = Input(shape=(decoder_dim[i],), name='decoder_state_input_c_' + str(i)) + decoder_states_inputs.append([decoder_state_input_h, decoder_state_input_c]) +decoder_states_inputs = [state for states in decoder_states_inputs for state in states] + +decoder_inference = decoder_inputs +decoder_states = [] +for i in range(dec_dim): + decoder_inference, state_h, state_c = decoder_lstm[i](decoder_inference, + initial_state=decoder_states_inputs[2*i:2*i+2]) + decoder_states.append([state_h,state_c]) +decoder_states = [state for states in decoder_states for state in states] + +decoder_outputs = decoder_dense(decoder_inference) +decoder_model = Model([decoder_inputs] + decoder_states_inputs, + [decoder_outputs] + decoder_states) +``` + +### 2. Training the model + +The seq2seq-LSTM model can be trained on a batch of -ideally- inliers by running the ```train.py``` script with the desired hyperparameters. The example below trains the model on the first 2628 ECG's of the ECG5000 dataset. The input/output sequence has a length of 140, the encoder has 1 bidirectional LSTM layer with 20 units, and the decoder consists of 1 LSTM layer with 40 units. This has to be 2x the number of units of the bidirectional encoder because both the forward and backward encoder states are used to initialise the decoder. Feature-wise minmax scaling between 0 and 1 is applied to the input sequence so we can use a sigmoid activation in the decoder's output layer. + +```python +!python train.py \ +--dataset './data/ECG5000_TEST.arff' \ +--data_range 0 2627 \ +--minmax \ +--timesteps 140 \ +--encoder_dim 20 \ +--decoder_dim 40 \ +--output_activation 'sigmoid' \ +--dropout 0 \ +--learning_rate 0.005 \ +--loss 'mean_squared_error' \ +--epochs 100 \ +--batch_size 32 \ +--validation_split 0.2 \ +--model_name 'seq2seq' \ +--print_progress \ +--save \ +--save_path './models/' +``` + +The model weights and hyperparameters are saved in the folder specified by "save_path". + +### 3. Making predictions + +In order to make predictions, which can then be served by Seldon Core, the pre-trained model weights and hyperparameters are loaded when defining an OutlierSeq2SeqLSTM object. The "threshold" argument defines above which reconstruction error a sample is classified as an outlier. The threshold is a key hyperparameter and needs to be picked carefully for each application. The OutlierSeq2SeqLSTM class inherits from the CoreSeq2SeqLSTM class in ```CoreSeq2SeqLSTM.py```. + +```python +class CoreSeq2SeqLSTM(object): + """ Outlier detection using a sequence-to-sequence (seq2seq) LSTM model. + + Parameters + ---------- + threshold (float): reconstruction error (mse) threshold used to classify outliers + reservoir_size (int) : number of observations kept in memory using reservoir sampling + + Functions + ---------- + reservoir_sampling : applies reservoir sampling to incoming data + predict : detect and return outliers + transform_input : detect outliers and return input features + send_feedback : add target labels as part of the feedback loop + tags : add metadata for input transformer + metrics : return custom metrics + """ + + def __init__(self,threshold=0.003,reservoir_size=50000,model_name='seq2seq',load_path='./models/'): + + logger.info("Initializing model") + self.threshold = threshold + self.reservoir_size = reservoir_size + self.batch = [] + self.N = 0 # total sample count up until now for reservoir sampling + self.nb_outliers = 0 + + # load model architecture parameters + with open(load_path + model_name + '.pickle', 'rb') as f: + self.timesteps, self.n_features, encoder_dim, decoder_dim, output_activation = pickle.load(f) + + # instantiate model + self.s2s, self.enc, self.dec = model(self.n_features,encoder_dim=encoder_dim, + decoder_dim=decoder_dim,output_activation=output_activation) + self.s2s.load_weights(load_path + model_name + '_weights.h5') # load pretrained model weights + self.s2s._make_predict_function() + self.enc._make_predict_function() + self.dec._make_predict_function() + + # load data preprocessing info + with open(load_path + 'preprocess_' + model_name + '.pickle', 'rb') as f: + preprocess = pickle.load(f) + self.preprocess, self.clip, self.axis = preprocess[:3] + if self.preprocess=='minmax': + self.xmin, self.xmax = preprocess[3:5] + self.min, self.max = preprocess[5:] + elif self.preprocess=='standardized': + self.mu, self.sigma = preprocess[3:] +``` + +```python +class OutlierSeq2SeqLSTM(CoreSeq2SeqLSTM): + """ Outlier detection using a sequence-to-sequence (seq2seq) LSTM model. + + Parameters + ---------- + threshold (float) : reconstruction error (mse) threshold used to classify outliers + reservoir_size (int) : number of observations kept in memory using reservoir sampling + + Functions + ---------- + send_feedback : add target labels as part of the feedback loop + metrics : return custom metrics + """ + def __init__(self,threshold=0.003,reservoir_size=50000,model_name='seq2seq',load_path='./models/'): + + super().__init__(threshold=threshold,reservoir_size=reservoir_size, + model_name=model_name,load_path=load_path) +``` + +The actual outlier detection is done by the ```_get_preds``` method which is invoked by ```predict``` or ```transform_input``` dependent on whether the detector is defined as respectively a model or a transformer. + +```python +def predict(self, X, feature_names): + """ Return outlier predictions. + + Parameters + ---------- + X : array-like + feature_names : array of feature names (optional) + """ + logger.info("Using component as a model") + return self._get_preds(X) +``` + +```python +def transform_input(self, X, feature_names): + """ Transform the input. + Used when the outlier detector sits on top of another model. + + Parameters + ---------- + X : array-like + feature_names : array of feature names (optional) + """ + logger.info("Using component as an outlier-detector transformer") + self.prediction_meta = self._get_preds(X) + return X +``` + +First the data is (optionally) clipped. If the number of observations fed to the outlier detector up until now is at least equal to the defined reservoir size, the feature-wise scaling parameters are updated using the observations in the reservoir. The reservoir is updated each observation using reservoir sampling. We can then scale the input data. + +```python +# clip data per feature +for col,clip in enumerate(self.clip): + X[:,:,col] = np.clip(X[:,:,col],-clip,clip) + +# update reservoir +if self.N < self.reservoir_size: + update_stand = False +else: + update_stand = True + +self.reservoir_sampling(X,update_stand=update_stand) + +# apply scaling +if self.preprocess=='minmax': + X = ((X - self.xmin) / (self.xmax - self.xmin)) * (self.max - self.min) + self.min +elif self.preprocess=='standardized': + X = (X - self.mu) / (self.sigma + 1e-10) +``` + +We then make predictions using the ```decode_sequence``` function and calculate the mean squared error between the input and output sequences. If this value is above the threshold, an outlier is predicted. + +```python +# make predictions +n_obs = X.shape[0] +self.mse = np.zeros(n_obs) +for obs in range(n_obs): + input_seq = X[obs:obs+1,:,:] + decoded_seq = self.decode_sequence(input_seq) + self.mse[obs] = np.mean(np.power(input_seq[0,:,:] - decoded_seq[0,:,:], 2)) +self.prediction = np.array([1 if e > self.threshold else 0 for e in self.mse]).astype(int) +``` + +The ```decode_sequence``` function takes an input sequence and uses the encoder model to retrieve the state vectors of the last LSTM layer in the encoder so they can be used to initialise the LSTM layers in the decoder. The feature values of the first step in the input sequence are used to initialise the output sequence. We can then use the decoder model to make sequential predictions for the output sequence. At each step, we use the previous step's output value and state as decoder inputs. + +```python +def decode_sequence(self,input_seq): + """ Feed output of encoder to decoder and make sequential predictions. """ + + # use encoder the get state vectors + states_value = self.enc.predict(input_seq) + + # generate initial target sequence + target_seq = input_seq[0,0,:].reshape((1,1,self.n_features)) + + # sequential prediction of time series + decoded_seq = np.zeros((1, self.timesteps, self.n_features)) + decoded_seq[0,0,:] = target_seq[0,0,:] + i = 1 + while i < self.timesteps: + + decoder_output = self.dec.predict([target_seq] + states_value) + + # update the target sequence + target_seq = np.zeros((1, 1, self.n_features)) + target_seq[0, 0, :] = decoder_output[0] + + # update output + decoded_seq[0, i, :] = decoder_output[0] + + # update states + states_value = decoder_output[1:] + + i+=1 + + return decoded_seq +``` + +## References + +Francois Chollet. A ten-minute introduction to sequence-to-sequence learning in Keras +- https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html + +Christopher Olah. Understanding LSTM Networks +- http://colah.github.io/posts/2015-08-Understanding-LSTMs/ + +Ilya Sutskever, Oriol Vinyals and Quoc V. Le. Sequence to Sequence Learning with Neural Networks. 2014 +- https://arxiv.org/abs/1409.3215 \ No newline at end of file diff --git a/components/outlier-detection/seq2seq-lstm/seq2seq_lstm.ipynb b/components/outlier-detection/seq2seq-lstm/seq2seq_lstm.ipynb index aed82e9850..2bbcb4e9e6 100644 --- a/components/outlier-detection/seq2seq-lstm/seq2seq_lstm.ipynb +++ b/components/outlier-detection/seq2seq-lstm/seq2seq_lstm.ipynb @@ -130,6 +130,39 @@ "## Test using Kubernetes cluster on GCP or Minikube" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Run the outlier detector as a model or a transformer. If you want to run the anomaly detector as a transformer, change the SERVICE_TYPE variable from MODEL to TRANSFORMER [here](./.s2i/environment), set MODEL = False and change ```OutlierSeq2SeqLSTM.py``` to:\n", + "\n", + "```python\n", + "from CoreSeq2SeqLSTM import CoreSeq2SeqLSTM\n", + "\n", + "class OutlierSeq2SeqLSTM(CoreSeq2SeqLSTM):\n", + " \"\"\" Outlier detection using a sequence-to-sequence (seq2seq) LSTM model.\n", + " \n", + " Parameters\n", + " ----------\n", + " threshold (float) : reconstruction error (mse) threshold used to classify outliers\n", + " reservoir_size (int) : number of observations kept in memory using reservoir sampling\n", + " \"\"\"\n", + " def __init__(self,threshold=0.003,reservoir_size=50000,model_name='seq2seq',load_path='./models/'):\n", + " \n", + " super().__init__(threshold=threshold,reservoir_size=reservoir_size,\n", + " model_name=model_name,load_path=load_path)\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "MODEL = True" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -143,7 +176,7 @@ "metadata": {}, "outputs": [], "source": [ - "minikube = True" + "MINIKUBE = True" ] }, { @@ -152,7 +185,7 @@ "metadata": {}, "outputs": [], "source": [ - "if minikube:\n", + "if MINIKUBE:\n", " !minikube start --memory 4096 --feature-gates=CustomResourceValidation=true \\\n", " --extra-config=apiserver.Authorization.Mode=RBAC\n", "else:\n", @@ -281,7 +314,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If Minikube used: create docker image for outlier detector inside Minikube using s2i." + "If Minikube used: create docker image for outlier detector inside Minikube using s2i. Besides the transformer image and the demo specific model image, the general model image for the Seq2Seq LSTM outlier detector is also available from Docker Hub as ***seldonio/outlier-s2s-lstm-model:0.1***." ] }, { @@ -290,16 +323,19 @@ "metadata": {}, "outputs": [], "source": [ - "if minikube:\n", + "if MINIKUBE & MODEL:\n", " !eval $(minikube docker-env) && \\\n", - " s2i build . seldonio/seldon-core-s2i-python3:0.4 seldonio/outlier-s2s-lstm:0.1" + " s2i build . seldonio/seldon-core-s2i-python3:0.4 seldonio/outlier-s2s-lstm-model-demo:0.1\n", + "elif MINIKUBE:\n", + " !eval $(minikube docker-env) && \\\n", + " s2i build . seldonio/seldon-core-s2i-python3:0.4 seldonio/outlier-s2s-lstm-transformer:0.1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Install outlier detector helm charts and set \"threshold\" and reservoir_size\" hyperparameter values." + "Install outlier detector helm charts and set *threshold* and *reservoir_size* hyperparameter values." ] }, { @@ -308,13 +344,30 @@ "metadata": {}, "outputs": [], "source": [ - "!helm install ../../../helm-charts/seldon-od-s2s-lstm \\\n", - " --set model.image.name=seldonio/outlier-s2s-lstm:0.1 \\\n", - " --set model.threshold=0.002 \\\n", - " --set model.reservoir_size=50000 \\\n", - " --name outlier-detector --set oauth.key=oauth-key \\\n", - " --set oauth.secret=oauth-secret \\\n", - " --namespace=seldon" + "if MODEL:\n", + " !helm install ../../../helm-charts/seldon-od-model \\\n", + " --name outlier-detector \\\n", + " --namespace=seldon \\\n", + " --set model.type=seq2seq \\\n", + " --set model.seq2seq.image.name=seldonio/outlier-s2s-lstm-model-demo:0.1 \\\n", + " --set model.seq2seq.threshold=0.002 \\\n", + " --set model.seq2seq.reservoir_size=50000 \\\n", + " --set oauth.key=oauth-key \\\n", + " --set oauth.secret=oauth-secret \\\n", + " --set replicas=1\n", + "else:\n", + " !helm install ../../../helm-charts/seldon-od-transformer \\\n", + " --name outlier-detector \\\n", + " --namespace=seldon \\\n", + " --set outlierDetection.enabled=true \\\n", + " --set outlierDetection.name=outlier-s2s-lstm \\\n", + " --set outlierDetection.type=seq2seq \\\n", + " --set outlierDetection.seq2seq.image.name=seldonio/outlier-s2s-lstm-transformer:0.1 \\\n", + " --set outlierDetection.seq2seq.threshold=0.002 \\\n", + " --set outlierDetection.seq2seq.reservoir_size=50000 \\\n", + " --set oauth.key=oauth-key \\\n", + " --set oauth.secret=oauth-secret \\\n", + " --set model.image.name=seldonio/outlier-s2s-lstm-model:0.1" ] }, { @@ -387,13 +440,21 @@ "response = rest_request_ambassador(\"outlier-detector\",request,endpoint=\"localhost:8003\")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the outlier detector is used as a transformer, the output of the anomaly detection is added as part of the metadata. If it is used as a model, we send model feedback to retrieve custom performance metrics." + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "send_feedback_rest(\"outlier-detector\",request,response,0,label,endpoint=\"localhost:8003\")" + "if MODEL:\n", + " send_feedback_rest(\"outlier-detector\",request,response,0,label,endpoint=\"localhost:8003\")" ] }, { @@ -469,17 +530,19 @@ "- Sample random ECG from dataset.\n", "- Get payload for the observation.\n", "- Make a prediction.\n", - "- Send the \"true\" label with the feedback.\n", + "- Send the \"true\" label with the feedback if the detector is run as a model.\n", "\n", "It is important that the prediction-feedback order is maintained. Otherwise there will be a mismatch between the predicted and \"true\" labels.\n", "\n", - "View the progress on the grafana \"Outlier Detection\" dashboard." + "View the progress on the grafana \"Outlier Detection\" dashboard. Most metrics need the outlier detector to be run as a model since they need model feedback." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "scrolled": true + }, "outputs": [], "source": [ "import time\n", @@ -491,7 +554,8 @@ " label = ecg_labels[idx].reshape(1)\n", " request = get_payload(X)\n", " response = rest_request_ambassador(\"outlier-detector\",request,endpoint=\"localhost:8003\")\n", - " send_feedback_rest(\"outlier-detector\",request,response,0,label,endpoint=\"localhost:8003\")\n", + " if MODEL:\n", + " send_feedback_rest(\"outlier-detector\",request,response,0,label,endpoint=\"localhost:8003\")\n", " time.sleep(1)" ] }, @@ -501,7 +565,7 @@ "metadata": {}, "outputs": [], "source": [ - "if minikube:\n", + "if MINIKUBE:\n", " !minikube delete" ] }, diff --git a/components/outlier-detection/seq2seq-lstm/seq2seq_lstm_doc.ipynb b/components/outlier-detection/seq2seq-lstm/seq2seq_lstm_doc.ipynb deleted file mode 100644 index 68bcb9a08f..0000000000 --- a/components/outlier-detection/seq2seq-lstm/seq2seq_lstm_doc.ipynb +++ /dev/null @@ -1,504 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Sequence-to-Sequence LSTM (seq2seq-LSTM) Outlier Algorithm Documentation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The aim of this document is to explain the seq2seq-LSTM algorithm in Seldon's outlier detection framework.\n", - "\n", - "First, we provide a high level overview of the algorithm and the use case, then we will give a detailed explanation of the implementation." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Overview" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Outlier detection has many applications, ranging from preventing credit card fraud to detecting computer network intrusions. The available data is typically unlabeled and detection needs to be done in real-time. The outlier detector can be used as a standalone algorithm, or to detect anomalies in the input data of another predictive model.\n", - "\n", - "The seq2seq-LSTM outlier detection algorithm is suitable for time series data and predicts whether a sequence of input features is an outlier or not, dependent on a threshold level set by the user. The algorithm needs to be pretrained first on a batch of -preferably- inliers.\n", - "\n", - "As observations arrive, the algorithm will:\n", - "- clip and scale the input features\n", - "- first encode, and then sequentially decode the input time series data in an attempt to reconstruct the initial observations\n", - "- compute a reconstruction error between the output of the decoder and the input data\n", - "- predict that the observation is an outlier if the error is larger than the threshold level" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Why Sequence-to-Sequence Models?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Seq2seq models convert sequences from one domain into sequences in another domain. A typical example would be sentence translation between different languages. A seq2seq model consists of 2 main building blocks: an encoder and a decoder. The encoder processes the input sequence and initializes the decoder. The decoder then makes sequential predictions for the output sequence. In our case, the decoder aims to reconstruct the input sequence. Both the encoder and decoder are typically implemented with recurrent or 1D convolutional neural networks. Our implementation uses a type of recurrent neural network called LSTM networks. An excellent explanation of how LSTM units work is available [here](http://colah.github.io/posts/2015-08-Understanding-LSTMs/). The loss function to be minimized with stochastic gradient descent is the mean squared error between the input and output sequence, and is called the reconstruction error.\n", - "\n", - "If we train the seq2seq model with inliers, it will be able to replicate new inlier data well with a low reconstruction error. However, if outliers are fed to the seq2seq model, the reconstruction error becomes large and we can classify the sequence as an anomaly." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Implementation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The implementation is inspired by [this blog post](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1. Building the seq2seq-LSTM Model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The seq2seq model definition in ```model.py``` takes 4 arguments that define the architecture:\n", - "- the number of features in the input\n", - "- a list with the number of units per [bidirectional](https://en.wikipedia.org/wiki/Bidirectional_recurrent_neural_networks) LSTM layer in the encoder\n", - "- a list with the number of units per LSTM layer in the decoder\n", - "- the output activation type for the dense output layer on top of the last LSTM unit in the decoder" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "``` python\n", - "def model(n_features, encoder_dim = [20], decoder_dim = [20], dropout=0., learning_rate=.001, \n", - " loss='mean_squared_error', output_activation='sigmoid'):\n", - " \"\"\" Build seq2seq model.\n", - " \n", - " Arguments:\n", - " - n_features (int): number of features in the data\n", - " - encoder_dim (list): list with number of units per encoder layer\n", - " - decoder_dim (list): list with number of units per decoder layer\n", - " - dropout (float): dropout for LSTM units\n", - " - learning_rate (float): learning rate used during training\n", - " - loss (str): loss function used\n", - " - output_activation (str): activation type for the dense output layer in the decoder\n", - " \"\"\"\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "First, we define the bidirectional LSTM layers in the encoder and keep the state of the last LSTM unit to initialise the decoder:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```python\n", - "# add encoder hidden layers\n", - "encoder_lstm = []\n", - "for i in range(enc_dim-1):\n", - " encoder_lstm.append(Bidirectional(LSTM(encoder_dim[i], dropout=dropout, \n", - " return_sequences=True,name='encoder_lstm_' + str(i))))\n", - " encoder_hidden = encoder_lstm[i](encoder_hidden)\n", - "\n", - "encoder_lstm.append(Bidirectional(LSTM(encoder_dim[-1], dropout=dropout, return_state=True, \n", - " name='encoder_lstm_' + str(enc_dim-1))))\n", - "encoder_outputs, forward_h, forward_c, backward_h, backward_c = encoder_lstm[-1](encoder_hidden)\n", - "\n", - "# only need to keep encoder states\n", - "state_h = Concatenate()([forward_h, backward_h])\n", - "state_c = Concatenate()([forward_c, backward_c])\n", - "encoder_states = [state_h, state_c]\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can then define the LSTM units in the decoder, with the states initialised by the encoder:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```python\n", - "# initialise decoder states with encoder states\n", - "decoder_lstm = []\n", - "for i in range(dec_dim):\n", - " decoder_lstm.append(LSTM(decoder_dim[i], dropout=dropout, return_sequences=True,\n", - " return_state=True, name='decoder_lstm_' + str(i)))\n", - " decoder_hidden, _, _ = decoder_lstm[i](decoder_hidden, initial_state=encoder_states)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We add a dense layer with output activation of choice on top of the last LSTM layer in the decoder and compile the model:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```python\n", - "# add linear layer on top of LSTM\n", - "decoder_dense = Dense(n_features, activation=output_activation, name='dense_output')\n", - "decoder_outputs = decoder_dense(decoder_hidden)\n", - "\n", - "# define seq2seq model\n", - "model = Model([encoder_inputs, decoder_inputs], decoder_outputs)\n", - "optimizer = Adam(lr=learning_rate)\n", - "model.compile(optimizer=optimizer, loss=loss)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The decoder predictions are sequential and we only need the encoder states to initialise the decoder for the first item in the sequence. From then on, the output and state of the decoder at each step in the sequence is used to predict the next item. As a result, we define separate encoder and decoder models for the prediction stage:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```python\n", - "# define encoder model returning encoder states\n", - "encoder_model = Model(encoder_inputs, encoder_states * dec_dim)\n", - "\n", - "# define decoder model\n", - "# need state inputs for each LSTM layer\n", - "decoder_states_inputs = []\n", - "for i in range(dec_dim):\n", - " decoder_state_input_h = Input(shape=(decoder_dim[i],), name='decoder_state_input_h_' + str(i))\n", - " decoder_state_input_c = Input(shape=(decoder_dim[i],), name='decoder_state_input_c_' + str(i))\n", - " decoder_states_inputs.append([decoder_state_input_h, decoder_state_input_c])\n", - "decoder_states_inputs = [state for states in decoder_states_inputs for state in states]\n", - "\n", - "decoder_inference = decoder_inputs\n", - "decoder_states = []\n", - "for i in range(dec_dim):\n", - " decoder_inference, state_h, state_c = decoder_lstm[i](decoder_inference, \n", - " initial_state=decoder_states_inputs[2*i:2*i+2])\n", - " decoder_states.append([state_h,state_c])\n", - "decoder_states = [state for states in decoder_states for state in states]\n", - "\n", - "decoder_outputs = decoder_dense(decoder_inference)\n", - "decoder_model = Model([decoder_inputs] + decoder_states_inputs,\n", - " [decoder_outputs] + decoder_states)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2. Training the model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The seq2seq-LSTM model can be trained on a batch of -ideally- inliers by running the ```train.py``` script with the desired hyperparameters. The example below trains the model on the first 2628 ECG's of the ECG5000 dataset. The input/output sequence has a length of 140, the encoder has 1 bidirectional LSTM layer with 20 units, and the decoder consists of 1 LSTM layer with 40 units. This has to be 2x the number of units of the bidirectional encoder because both the forward and backward encoder states are used to initialise the decoder. Feature-wise minmax scaling between 0 and 1 is applied to the input sequence so we can use a sigmoid activation in the decoder's output layer." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```python\n", - "!python train.py \\\n", - "--dataset './data/ECG5000_TEST.arff' \\\n", - "--data_range 0 2627 \\\n", - "--minmax \\\n", - "--timesteps 140 \\\n", - "--encoder_dim 20 \\\n", - "--decoder_dim 40 \\\n", - "--output_activation 'sigmoid' \\\n", - "--dropout 0 \\\n", - "--learning_rate 0.005 \\\n", - "--loss 'mean_squared_error' \\\n", - "--epochs 100 \\\n", - "--batch_size 32 \\\n", - "--validation_split 0.2 \\\n", - "--model_name 'seq2seq' \\\n", - "--print_progress \\\n", - "--save \\\n", - "--save_path './models/'\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The model weights and hyperparameters are saved in the folder specified by \"save_path\"." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 3. Making predictions" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In order to make predictions, which can then be served by Seldon Core, the pre-trained model weights and hyperparameters are loaded when defining an OutlierSeq2SeqLSTM object. The \"threshold\" argument defines above which reconstruction error a sample is classified as an outlier. The threshold is a key hyperparameter and needs to be picked carefully for each application." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```python\n", - "class OutlierSeq2SeqLSTM(object):\n", - " \"\"\" Outlier detection using a sequence-to-sequence (seq2seq) LSTM model.\n", - " \n", - " Arguments:\n", - " - threshold: (float): reconstruction error (mse) threshold used to classify outliers\n", - " - reservoir_size (int): number of observations kept in memory using reservoir sampling\n", - " \n", - " Functions:\n", - " - reservoir_sampling: applies reservoir sampling to incoming data\n", - " - predict: detect and return outliers\n", - " - send_feedback: add target labels as part of the feedback loop\n", - " - metrics: return custom metrics\n", - " \"\"\"\n", - " def __init__(self,threshold=0.003,reservoir_size=50000,model_name='model',load_path='./models/'):\n", - " \n", - " self.threshold = threshold\n", - " self.reservoir_size = reservoir_size\n", - " self.batch = []\n", - " self.N = 0 # total sample count up until now for reservoir sampling\n", - " \n", - " # load model architecture parameters\n", - " with open(load_path + model_name + '.pickle', 'rb') as f:\n", - " self.timesteps, self.n_features, encoder_dim, decoder_dim, output_activation = pickle.load(f)\n", - " \n", - " # instantiate model\n", - " self.s2s, self.enc, self.dec = model(self.n_features,encoder_dim=encoder_dim,\n", - " decoder_dim=decoder_dim,output_activation=output_activation)\n", - " self.s2s.load_weights(load_path + model_name + '_weights.h5') # load pretrained model weights\n", - " self.s2s._make_predict_function()\n", - " self.enc._make_predict_function()\n", - " self.dec._make_predict_function()\n", - " \n", - " # load data preprocessing info\n", - " with open(load_path + 'preprocess_' + model_name + '.pickle', 'rb') as f:\n", - " preprocess = pickle.load(f)\n", - " self.preprocess, self.clip, self.axis = preprocess[:3]\n", - " if self.preprocess=='minmax':\n", - " self.xmin, self.xmax = preprocess[3:5]\n", - " self.min, self.max = preprocess[5:]\n", - " elif self.preprocess=='standardized':\n", - " self.mu, self.sigma = preprocess[3:]\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The predict method does the actual outlier detection." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```python\n", - "def predict(self,X,feature_names):\n", - " \"\"\" Detect outliers from mse using the threshold. \n", - "\n", - " Arguments:\n", - " - X: input data\n", - " - feature_names\n", - " \"\"\"\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "First the data is (optionally) clipped. If the number of observations fed to the outlier detector up until now is at least equal to the defined reservoir size, the feature-wise scaling parameters are updated using the observations in the reservoir. The reservoir is updated each observation using reservoir sampling. We can then scale the input data." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```python\n", - "# clip data per feature\n", - "for col,clip in enumerate(self.clip):\n", - " X[:,:,col] = np.clip(X[:,:,col],-clip,clip)\n", - "\n", - "# update reservoir\n", - "if self.N < self.reservoir_size:\n", - " update_stand = False\n", - "else:\n", - " update_stand = True\n", - "\n", - "self.reservoir_sampling(X,update_stand=update_stand)\n", - "\n", - "# apply scaling\n", - "if self.preprocess=='minmax':\n", - " X = ((X - self.xmin) / (self.xmax - self.xmin)) * (self.max - self.min) + self.min\n", - "elif self.preprocess=='standardized':\n", - " X = (X - self.mu) / (self.sigma + 1e-10)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We then make predictions using the ```decode_sequence``` function and calculate the mean squared error between the input and output sequences. If this value is above the threshold, an outlier is predicted." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```python\n", - "# make predictions\n", - "n_obs = X.shape[0]\n", - "self.mse = np.zeros(n_obs)\n", - "for obs in range(n_obs):\n", - " input_seq = X[obs:obs+1,:,:]\n", - " decoded_seq = self.decode_sequence(input_seq)\n", - " self.mse[obs] = np.mean(np.power(input_seq[0,:,:] - decoded_seq[0,:,:], 2))\n", - "self.prediction = np.array([1 if e > self.threshold else 0 for e in self.mse]).astype(int)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The ```decode_sequence``` function takes an input sequence and uses the encoder model to retrieve the state vectors of the last LSTM layer in the encoder so they can be used to initialise the LSTM layers in the decoder. The feature values of the first step in the input sequence are used to initialise the output sequence. We can then use the decoder model to make sequential predictions for the output sequence. At each step, we use the previous step's output value and state as decoder inputs." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```python\n", - "def decode_sequence(self,input_seq):\n", - " \"\"\" Feed output of encoder to decoder and make sequential predictions. \"\"\"\n", - "\n", - " # use encoder the get state vectors\n", - " states_value = self.enc.predict(input_seq)\n", - "\n", - " # generate initial target sequence\n", - " target_seq = input_seq[0,0,:].reshape((1,1,self.n_features))\n", - "\n", - " # sequential prediction of time series\n", - " decoded_seq = np.zeros((1, self.timesteps, self.n_features))\n", - " decoded_seq[0,0,:] = target_seq[0,0,:]\n", - " i = 1\n", - " while i < self.timesteps:\n", - "\n", - " decoder_output = self.dec.predict([target_seq] + states_value)\n", - "\n", - " # update the target sequence\n", - " target_seq = np.zeros((1, 1, self.n_features))\n", - " target_seq[0, 0, :] = decoder_output[0]\n", - "\n", - " # update output\n", - " decoded_seq[0, i, :] = decoder_output[0]\n", - "\n", - " # update states\n", - " states_value = decoder_output[1:]\n", - "\n", - " i+=1\n", - "\n", - " return decoded_seq\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## References" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Francois Chollet. A ten-minute introduction to sequence-to-sequence learning in Keras\n", - "- https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html\n", - "\n", - "Christopher Olah. Understanding LSTM Networks\n", - "- http://colah.github.io/posts/2015-08-Understanding-LSTMs/\n", - "\n", - "Ilya Sutskever, Oriol Vinyals and Quoc V. Le. Sequence to Sequence Learning with Neural Networks. 2014\n", - "- https://arxiv.org/abs/1409.3215" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/components/outlier-detection/vae/CoreVAE.py b/components/outlier-detection/vae/CoreVAE.py new file mode 100644 index 0000000000..79d736435c --- /dev/null +++ b/components/outlier-detection/vae/CoreVAE.py @@ -0,0 +1,182 @@ +import logging +import numpy as np +import pickle +import random + +from model import model + +logger = logging.getLogger(__name__) + + +class CoreVAE(object): + """ Outlier detection using variational autoencoders (VAE). + + Parameters + ---------- + threshold (float) : reconstruction error (mse) threshold used to classify outliers + reservoir_size (int) : number of observations kept in memory using reservoir sampling + + Functions + ---------- + reservoir_sampling : applies reservoir sampling to incoming data + predict : detect and return outliers + transform_input : detect outliers and return input features + send_feedback : add target labels as part of the feedback loop + tags : add metadata for input transformer + metrics : return custom metrics + """ + + def __init__(self,threshold=10,reservoir_size=50000,model_name='vae',load_path='./models/'): + + logger.info("Initializing model") + self.threshold = threshold + self.reservoir_size = reservoir_size + self.batch = [] + self.N = 0 # total sample count up until now for reservoir sampling + self.nb_outliers = 0 + + # load model architecture parameters + with open(load_path + model_name + '.pickle', 'rb') as f: + n_features, hidden_layers, latent_dim, hidden_dim, output_activation = pickle.load(f) + + # instantiate model + self.vae = model(n_features,hidden_layers=hidden_layers,latent_dim=latent_dim, + hidden_dim=hidden_dim,output_activation=output_activation) + self.vae.load_weights(load_path + model_name + '_weights.h5') # load pretrained model weights + self.vae._make_predict_function() + + # load data preprocessing info + with open(load_path + 'preprocess_' + model_name + '.pickle', 'rb') as f: + preprocess = pickle.load(f) + self.preprocess, self.clip, self.axis = preprocess[:3] + if self.preprocess=='minmax': + self.xmin, self.xmax = preprocess[3:5] + self.min, self.max = preprocess[5:] + elif self.preprocess=='standardized': + self.mu, self.sigma = preprocess[3:] + + + def reservoir_sampling(self,X,update_stand=False): + """ Keep batch of data in memory using reservoir sampling. """ + for item in X: + self.N+=1 + if len(self.batch) < self.reservoir_size: + self.batch.append(item) + else: + s = int(random.random() * self.N) + if s < self.reservoir_size: + self.batch[s] = item + + if update_stand: + if self.preprocess=='minmax': + self.xmin = np.array(self.batch).min(axis=self.axis) + self.xmax = np.array(self.batch).max(axis=self.axis) + elif self.preprocess=='standardized': + self.mu = np.array(self.batch).mean(axis=self.axis) + self.sigma = np.array(self.batch).std(axis=self.axis) + return + + + def predict(self, X, feature_names): + """ Return outlier predictions. + + Parameters + ---------- + X : array-like + feature_names : array of feature names (optional) + """ + logger.info("Using component as a model") + return self._get_preds(X) + + + def transform_input(self, X, feature_names): + """ Transform the input. + Used when the outlier detector sits on top of another model. + + Parameters + ---------- + X : array-like + feature_names : array of feature names (optional) + """ + logger.info("Using component as an outlier-detector transformer") + self.prediction_meta = self._get_preds(X) + return X + + + def _get_preds(self, X): + """ Detect outliers if the reconstruction error is above the threshold. + + Parameters + ---------- + X : array-like + """ + + # clip data per feature + X = np.clip(X,[-c for c in self.clip],self.clip) + + if self.N < self.reservoir_size: + update_stand = False + else: + update_stand = True + + self.reservoir_sampling(X,update_stand=update_stand) + + # apply scaling + if self.preprocess=='minmax': + X_scaled = ((X - self.xmin) / (self.xmax - self.xmin)) * (self.max - self.min) + self.min + elif self.preprocess=='standardized': + X_scaled = (X - self.mu) / (self.sigma + 1e-10) + + # sample latent variables and calculate reconstruction errors + N = 10 + mse = np.zeros([X.shape[0],N]) + for i in range(N): + preds = self.vae.predict(X_scaled) + mse[:,i] = np.mean(np.power(X_scaled - preds, 2), axis=1) + self.mse = np.mean(mse, axis=1) + + # make prediction + self.prediction = np.array([1 if e > self.threshold else 0 for e in self.mse]).astype(int) + + return self.prediction + + + def send_feedback(self,X,feature_names,reward,truth): + """ Return additional data as part of the feedback loop. + + Parameters + ---------- + X : array of the features sent in the original predict request + feature_names : array of feature names. May be None if not available. + reward (float): the reward + truth : array with correct value (optional) + """ + logger.info("Send feedback called") + return [] + + + def tags(self): + """ + Use predictions made within transform to add these as metadata + to the response. Tags will only be collected if the component is + used as an input-transformer. + """ + try: + return {"outlier-predictions": self.prediction_meta.tolist()} + except AttributeError: + logger.info("No metadata about outliers") + + + def metrics(self): + """ Return custom metrics averaged over the prediction batch. + """ + self.nb_outliers += np.sum(self.prediction) + + is_outlier = {"type":"GAUGE","key":"is_outlier","value":np.mean(self.prediction)} + mse = {"type":"GAUGE","key":"mse","value":np.mean(self.mse)} + nb_outliers = {"type":"GAUGE","key":"nb_outliers","value":int(self.nb_outliers)} + fraction_outliers = {"type":"GAUGE","key":"fraction_outliers","value":int(self.nb_outliers)/self.N} + obs = {"type":"GAUGE","key":"observation","value":self.N} + threshold = {"type":"GAUGE","key":"threshold","value":self.threshold} + + return [is_outlier,mse,nb_outliers,fraction_outliers,obs,threshold] \ No newline at end of file diff --git a/components/outlier-detection/vae/OutlierVAE.py b/components/outlier-detection/vae/OutlierVAE.py index aa9b5a2fd9..7f92dcf866 100644 --- a/components/outlier-detection/vae/OutlierVAE.py +++ b/components/outlier-detection/vae/OutlierVAE.py @@ -1,50 +1,27 @@ import numpy as np -import pickle -import random -from model import model +from CoreVAE import CoreVAE from utils import flatten, performance, outlier_stats -class OutlierVAE(object): +class OutlierVAE(CoreVAE): """ Outlier detection using variational autoencoders (VAE). - Arguments: - - threshold: (float): reconstruction error (mse) threshold used to classify outliers - - reservoir_size (int): number of observations kept in memory using reservoir sampling used for mean and stdev + Parameters + ---------- + threshold (float) : reconstruction error (mse) threshold used to classify outliers + reservoir_size (int) : number of observations kept in memory using reservoir sampling - Functions: - - reservoir_sampling: applies reservoir sampling to incoming data - - predict: detect and return outliers - - send_feedback: add target labels as part of the feedback loop - - metrics: return custom metrics + Functions + ---------- + send_feedback : add target labels as part of the feedback loop + metrics : return custom metrics """ + def __init__(self,threshold=10,reservoir_size=50000,model_name='vae',load_path='./models/'): - self.threshold = threshold - self.reservoir_size = reservoir_size - self.batch = [] - self.N = 0 # total sample count up until now for reservoir sampling - - # load model architecture parameters - with open(load_path + model_name + '.pickle', 'rb') as f: - n_features, hidden_layers, latent_dim, hidden_dim, output_activation = pickle.load(f) - - # instantiate model - self.vae = model(n_features,hidden_layers=hidden_layers,latent_dim=latent_dim, - hidden_dim=hidden_dim,output_activation=output_activation) - self.vae.load_weights(load_path + model_name + '_weights.h5') # load pretrained model weights - self.vae._make_predict_function() - - # load data preprocessing info - with open(load_path + 'preprocess_' + model_name + '.pickle', 'rb') as f: - preprocess = pickle.load(f) - self.preprocess, self.clip, self.axis = preprocess[:3] - if self.preprocess=='minmax': - self.xmin, self.xmax = preprocess[3:5] - self.min, self.max = preprocess[5:] - elif self.preprocess=='standardized': - self.mu, self.sigma = preprocess[3:] + super().__init__(threshold=threshold,reservoir_size=reservoir_size, + model_name=model_name,load_path=load_path) self._predictions = [] self._labels = [] @@ -52,83 +29,31 @@ def __init__(self,threshold=10,reservoir_size=50000,model_name='vae',load_path=' self.roll_window = 100 self.metric = [float('nan') for i in range(18)] - - def reservoir_sampling(self,X,update_stand=False): - """ Keep batch of data in memory using reservoir sampling. """ - for item in X: - self.N+=1 - if len(self.batch) < self.reservoir_size: - self.batch.append(item) - else: - s = int(random.random() * self.N) - if s < self.reservoir_size: - self.batch[s] = item - - if update_stand: - if self.preprocess=='minmax': - self.xmin = np.array(self.batch).min(axis=self.axis) - self.xmax = np.array(self.batch).max(axis=self.axis) - elif self.preprocess=='standardized': - self.mu = np.array(self.batch).mean(axis=self.axis) - self.sigma = np.array(self.batch).std(axis=self.axis) - return - - - def predict(self,X,feature_names): - """ Detect outliers from mse using the threshold. + + def send_feedback(self,X,feature_names,reward,truth): + """ Return outlier labels as part of the feedback loop. - Arguments: - - X: input data - - feature_names + Parameters + ---------- + X : array of the features sent in the original predict request + feature_names : array of feature names. May be None if not available. + reward (float): the reward + truth : array with correct value (optional) """ + _ = super().send_feedback(X,feature_names,reward,truth) - # clip data per feature - X = np.clip(X,[-c for c in self.clip],self.clip) - - if self.N < self.reservoir_size: - update_stand = False - else: - update_stand = True - - self.reservoir_sampling(X,update_stand=update_stand) - - # apply scaling - if self.preprocess=='minmax': - X_scaled = ((X - self.xmin) / (self.xmax - self.xmin)) * (self.max - self.min) + self.min - elif self.preprocess=='standardized': - X_scaled = (X - self.mu) / (self.sigma + 1e-10) - - # sample latent variables and calculate reconstruction errors - N = 10 - mse = np.zeros([X.shape[0],N]) - for i in range(N): - preds = self.vae.predict(X_scaled) - mse[:,i] = np.mean(np.power(X_scaled - preds, 2), axis=1) - self.mse = np.mean(mse, axis=1) + # historical reconstruction errors and predictions self._mse.append(self.mse) self._mse = flatten(self._mse) - - # make prediction - self.prediction = np.array([1 if e > self.threshold else 0 for e in self.mse]).astype(int) self._predictions.append(self.prediction) self._predictions = flatten(self._predictions) - return self.prediction - - - def send_feedback(self,X,feature_names,reward,truth): - """ Return outlier labels as part of the feedback loop. - - Arguments: - - X: input data - - feature_names - - reward - - truth: outlier labels - """ + # target labels self.label = truth self._labels.append(self.label) self._labels = flatten(self._labels) + # performance metrics scores = performance(self._labels,self._predictions,roll_window=self.roll_window) stats = outlier_stats(self._labels,self._predictions,roll_window=self.roll_window) @@ -137,9 +62,9 @@ def send_feedback(self,X,feature_names,reward,truth): for c in convert: # convert from np to native python type to jsonify metric.append(np.asscalar(np.asarray(c))) self.metric = metric - - return - + + return [] + def metrics(self): """ Return custom metrics. @@ -154,8 +79,8 @@ def metrics(self): err = float('nan') y_true = float('nan') else: - pred = int(self._predictions[-2]) - err = self._mse[-2] + pred = int(self._predictions[-1]) + err = self._mse[-1] y_true = int(self.label[0]) is_outlier = {"type":"GAUGE","key":"is_outlier","value":pred} diff --git a/components/outlier-detection/vae/README.md b/components/outlier-detection/vae/README.md index 2d3a0ca791..07bc4ecc70 100644 --- a/components/outlier-detection/vae/README.md +++ b/components/outlier-detection/vae/README.md @@ -8,8 +8,15 @@ The architecture of the VAE is defined in ```model.py``` and the model is trained by running the ```train.py``` script. The ```OutlierVAE``` class loads a pre-trained model and makes predictions on new data. -A detailed explanation of the implementation and usage of the Variational Auto-Encoder as an outlier detector can be found in the [outlier_vae_doc](./outlier_vae_doc.ipynb) notebook. +A detailed explanation of the implementation and usage of the Variational Auto-Encoder as an outlier detector can be found in the [VAE documentation](./doc.md). ## Running on Seldon -An end-to-end example running a VAE outlier detector on GCP or Minikube using Seldon to identify computer network intrusions is available [here](./outlier_vae.ipynb). \ No newline at end of file +An end-to-end example running a VAE outlier detector on GCP or Minikube using Seldon to identify computer network intrusions is available [here](./outlier_vae.ipynb). + +Docker images to use the generic VAE outlier detector as a model or transformer can be found on Docker Hub: +* [seldonio/outlier-vae-model](https://hub.docker.com/r/seldonio/outlier-vae-model) +* [seldonio/outlier-vae-transformer](https://hub.docker.com/r/seldonio/outlier-vae-transformer) + +A model docker image specific for the demo is also available: +* [seldonio/outlier-vae-model-demo](https://hub.docker.com/r/seldonio/outlier-vae-model-demo) \ No newline at end of file diff --git a/components/outlier-detection/vae/doc.md b/components/outlier-detection/vae/doc.md new file mode 100644 index 0000000000..d26290affc --- /dev/null +++ b/components/outlier-detection/vae/doc.md @@ -0,0 +1,292 @@ +# Variational Auto-Encoder Outlier (VAE) Algorithm Documentation + +The aim of this document is to explain the Variational Auto-Encoder algorithm in Seldon's outlier detection framework. + +First, we provide a high level overview of the algorithm and the use case, then we will give a detailed explanation of the implementation. + +## Overview + +Outlier detection has many applications, ranging from preventing credit card fraud to detecting computer network intrusions. The available data is typically unlabeled and detection needs to be done in real-time. The outlier detector can be used as a standalone algorithm, or to detect anomalies in the input data of another predictive model. + +The VAE outlier detection algorithm predicts whether the input features are an outlier or not, dependent on a threshold level set by the user. The algorithm needs to be pretrained first on a batch of -preferably- inliers. + +As observations arrive, the algorithm will: +- scale (standardize or minmax) the input features +- first encode, and then decode the input data in an attempt to reconstruct the initial observations +- compute a reconstruction error between the output of the decoder and the input data +- predict that the observation is an outlier if the error is larger than the threshold level + +## Why Variational Auto-Encoders? + +An Auto-Encoder is an algorithm that consists of 2 main building blocks: an encoder and a decoder. The encoder tries to find a compressed representation of the input data. The compressed data is then fed into the decoder, which aims to replicate the input data. Both the encoder and decoder are typically implemented with neural networks. The loss function to be minimized with stochastic gradient descent is a distance function between the input data and output of the decoder, and is called the reconstruction error. + +If we train the Auto-Encoder with inliers, it will be able to replicate new inlier data well with a low reconstruction error. However, if outliers are fed to the Auto-Encoder, the reconstruction error becomes large and we can classify the observation as an anomaly. + +A Variational Auto-Encoder adds constraints to the encoded representations of the input. The encodings are parameters of a probability distribution modeling the data. The decoder can then generate new data by sampling from the learned distribution. + +## Implementation + +### 1. Building the VAE model + +The VAE model definition in ```model.py``` takes 4 arguments that define the architecture: +- the number of features in the input +- the number of hidden layers used in the encoder and decoder +- the dimension of the latent variable +- the dimensions of each hidden layer + +``` python +def model(n_features, hidden_layers=1, latent_dim=2, hidden_dim=[], + output_activation='sigmoid', learning_rate=0.001): + """ Build VAE model. + + Arguments: + - n_features (int): number of features in the data + - hidden_layers (int): number of hidden layers used in encoder/decoder + - latent_dim (int): dimension of latent variable + - hidden_dim (list): list with dimension of each hidden layer + - output_activation (str): activation type for last dense layer in the decoder + - learning_rate (float): learning rate used during training + """ +``` + +First, the input data feeds in the encoder and is compressed by mapping it on the latent space which defines the probability distribution of the encodings: + +``` python + # encoder + inputs = Input(shape=(n_features,), name='encoder_input') + # define hidden layers + enc_hidden = Dense(hidden_dim[0], activation='relu', name='encoder_hidden_0')(inputs) + i = 1 + while i < hidden_layers: + enc_hidden = Dense(hidden_dim[i],activation='relu',name='encoder_hidden_'+str(i))(enc_hidden) + i+=1 + + z_mean = Dense(latent_dim, name='z_mean')(enc_hidden) + z_log_var = Dense(latent_dim, name='z_log_var')(enc_hidden) +``` + +We can then sample data from the latent space. + +``` python +def sampling(args): + """ Reparameterization trick by sampling from an isotropic unit Gaussian. + + Arguments: + - args (tensor): mean and log of variance of Q(z|X) + + Returns: + - z (tensor): sampled latent vector + """ + z_mean, z_log_var = args + batch = K.shape(z_mean)[0] + dim = K.int_shape(z_mean)[1] + epsilon = K.random_normal(shape=(batch, dim)) # by default, random_normal has mean=0 and std=1.0 + return z_mean + K.exp(0.5 * z_log_var) * epsilon # mean + stdev * eps +``` + +``` python + # reparametrization trick to sample z + z = Lambda(sampling, output_shape=(latent_dim,), name='z')([z_mean, z_log_var]) +``` + +The sampled data passes through the decoder which aims to reconstruct the input. + +``` python + # decoder + latent_inputs = Input(shape=(latent_dim,), name='z_sampling') + # define hidden layers + dec_hidden = Dense(hidden_dim[-1], activation='relu', name='decoder_hidden_0')(latent_inputs) + + i = 2 + while i < hidden_layers+1: + dec_hidden = Dense(hidden_dim[-i],activation='relu',name='decoder_hidden_'+str(i-1))(dec_hidden) + i+=1 + + outputs = Dense(n_features, activation=output_activation, name='decoder_output')(dec_hidden) +``` + +The loss function is the sum of the reconstruction error and the KL-divergence. While the reconstruction error quantifies how well we can recreate the input data, the KL-divergence measures how close the latent representation is to the unit Gaussian distribution. This trade-off is important because we want our encodings to parameterize a probability distribution from which we can sample data. + +``` python + # define VAE loss, optimizer and compile model + reconstruction_loss = mse(inputs, outputs) + reconstruction_loss *= n_features + kl_loss = 1 + z_log_var - K.square(z_mean) - K.exp(z_log_var) + kl_loss = K.sum(kl_loss, axis=-1) + kl_loss *= -0.5 + vae_loss = K.mean(reconstruction_loss + kl_loss) + vae.add_loss(vae_loss) +``` + +### 2. Training the model + +The VAE model can be trained on a batch of inliers by running the ```train.py``` script with the desired hyperparameters: + +``` python +!python train.py \ +--dataset 'kddcup99' \ +--samples 50000 \ +--keep_cols "$cols_str" \ +--hidden_layers 1 \ +--latent_dim 2 \ +--hidden_dim 9 \ +--output_activation 'sigmoid' \ +--clip 999999 \ +--standardized \ +--epochs 10 \ +--batch_size 32 \ +--learning_rate 0.001 \ +--print_progress \ +--model_name 'vae' \ +--save \ +--save_path './models/' +``` + +The model weights and hyperparameters are saved in the folder specified by "save_path". + +### 3. Making predictions + +In order to make predictions, which can then be served by Seldon Core, the pre-trained model weights and hyperparameters are loaded when defining an OutlierVAE object. The "threshold" argument defines above which reconstruction error a sample is classified as an outlier. The threshold is a key hyperparameter and needs to be picked carefully for each application. The OutlierVAE class inherits from the CoreVAE class in ```CoreVAE.py```. + +```python +class CoreVAE(object): + """ Outlier detection using variational autoencoders (VAE). + + Parameters + ---------- + threshold (float) : reconstruction error (mse) threshold used to classify outliers + reservoir_size (int) : number of observations kept in memory using reservoir sampling + + Functions + ---------- + reservoir_sampling : applies reservoir sampling to incoming data + predict : detect and return outliers + transform_input : detect outliers and return input features + send_feedback : add target labels as part of the feedback loop + tags : add metadata for input transformer + metrics : return custom metrics + """ + + def __init__(self,threshold=10,reservoir_size=50000,model_name='vae',load_path='./models/'): + + logger.info("Initializing model") + self.threshold = threshold + self.reservoir_size = reservoir_size + self.batch = [] + self.N = 0 # total sample count up until now for reservoir sampling + self.nb_outliers = 0 + + # load model architecture parameters + with open(load_path + model_name + '.pickle', 'rb') as f: + n_features, hidden_layers, latent_dim, hidden_dim, output_activation = pickle.load(f) + + # instantiate model + self.vae = model(n_features,hidden_layers=hidden_layers,latent_dim=latent_dim, + hidden_dim=hidden_dim,output_activation=output_activation) + self.vae.load_weights(load_path + model_name + '_weights.h5') # load pretrained model weights + self.vae._make_predict_function() + + # load data preprocessing info + with open(load_path + 'preprocess_' + model_name + '.pickle', 'rb') as f: + preprocess = pickle.load(f) + self.preprocess, self.clip, self.axis = preprocess[:3] + if self.preprocess=='minmax': + self.xmin, self.xmax = preprocess[3:5] + self.min, self.max = preprocess[5:] + elif self.preprocess=='standardized': + self.mu, self.sigma = preprocess[3:] +``` + +``` python +class OutlierVAE(CoreVAE): + """ Outlier detection using variational autoencoders (VAE). + + Parameters + ---------- + threshold (float) : reconstruction error (mse) threshold used to classify outliers + reservoir_size (int) : number of observations kept in memory using reservoir sampling + + Functions + ---------- + send_feedback : add target labels as part of the feedback loop + metrics : return custom metrics + """ + + def __init__(self,threshold=10,reservoir_size=50000,model_name='vae',load_path='./models/'): + + super().__init__(threshold=threshold,reservoir_size=reservoir_size, + model_name=model_name,load_path=load_path) +``` + +The actual outlier detection is done by the ```_get_preds``` method which is invoked by ```predict``` or ```transform_input``` dependent on whether the detector is defined as respectively a model or a transformer. + +```python +def predict(self, X, feature_names): + """ Return outlier predictions. + + Parameters + ---------- + X : array-like + feature_names : array of feature names (optional) + """ + logger.info("Using component as a model") + return self._get_preds(X) +``` + +```python +def transform_input(self, X, feature_names): + """ Transform the input. + Used when the outlier detector sits on top of another model. + + Parameters + ---------- + X : array-like + feature_names : array of feature names (optional) + """ + logger.info("Using component as an outlier-detector transformer") + self.prediction_meta = self._get_preds(X) + return X +``` + +In ```_get_preds```, the observations are first clipped. If the number of observations fed to the outlier detector up until now is at least equal to the defined reservoir size, the feature-wise scaling parameters are updated using the observations in the reservoir. The reservoir is updated each observation using reservoir sampling. The input data is then scaled using either standardization or minmax scaling. + +``` python + # clip data per feature + X = np.clip(X,[-c for c in self.clip],self.clip) + + if self.N < self.reservoir_size: + update_stand = False + else: + update_stand = True + + self.reservoir_sampling(X,update_stand=update_stand) + + # apply scaling + if self.preprocess=='minmax': + X_scaled = ((X - self.xmin) / (self.xmax - self.xmin)) * (self.max - self.min) + self.min + elif self.preprocess=='standardized': + X_scaled = (X - self.mu) / (self.sigma + 1e-10) +``` + +We then make multiple predictions for an observation by sampling N times from the latent space. The mean squared error between the input data and output of the decoder is averaged across the N samples. If this value is above the threshold, an outlier is predicted. + +``` python + # sample latent variables and calculate reconstruction errors + N = 10 + mse = np.zeros([X.shape[0],N]) + for i in range(N): + preds = self.vae.predict(X_scaled) + mse[:,i] = np.mean(np.power(X_scaled - preds, 2), axis=1) + self.mse = np.mean(mse, axis=1) + + # make prediction + self.prediction = np.array([1 if e > self.threshold else 0 for e in self.mse]).astype(int) +``` + +## References + +Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. ICLR 2014. +- https://arxiv.org/pdf/1312.6114.pdf + +Francois Chollet. Building Autoencoders in Keras. +- https://blog.keras.io/building-autoencoders-in-keras.html \ No newline at end of file diff --git a/components/outlier-detection/vae/outlier_vae.ipynb b/components/outlier-detection/vae/outlier_vae.ipynb index 45cf742066..6aefef367f 100644 --- a/components/outlier-detection/vae/outlier_vae.ipynb +++ b/components/outlier-detection/vae/outlier_vae.ipynb @@ -122,6 +122,40 @@ "## Test using Kubernetes cluster on GCP or Minikube" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Run the outlier detector as a model or a transformer. If you want to run the anomaly detector as a transformer, change the SERVICE_TYPE variable from MODEL to TRANSFORMER [here](./.s2i/environment), set MODEL = False and change ```OutlierVAE.py``` to:\n", + "\n", + "```python\n", + "from CoreVAE import CoreVAE\n", + "\n", + "class OutlierVAE(CoreVAE):\n", + " \"\"\" Outlier detection using variational autoencoders (VAE).\n", + " \n", + " Parameters\n", + " ----------\n", + " threshold (float) : reconstruction error (mse) threshold used to classify outliers\n", + " reservoir_size (int) : number of observations kept in memory using reservoir sampling\n", + " \"\"\"\n", + " \n", + " def __init__(self,threshold=10,reservoir_size=50000,model_name='vae',load_path='./models/'):\n", + " \n", + " super().__init__(threshold=threshold,reservoir_size=reservoir_size,\n", + " model_name=model_name,load_path=load_path)\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "MODEL = True" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -135,7 +169,7 @@ "metadata": {}, "outputs": [], "source": [ - "minikube = True" + "MINIKUBE = True" ] }, { @@ -146,7 +180,7 @@ }, "outputs": [], "source": [ - "if minikube:\n", + "if MINIKUBE:\n", " !minikube start --memory 4096 --feature-gates=CustomResourceValidation=true \\\n", " --extra-config=apiserver.Authorization.Mode=RBAC\n", "else:\n", @@ -275,7 +309,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If Minikube used: create docker image for outlier detector inside Minikube using s2i." + "If Minikube used: create docker image for outlier detector inside Minikube using s2i. Besides the transformer image and the demo specific model image, the general model image for the VAE outlier detector is also available from Docker Hub as ***seldonio/outlier-vae-model:0.1***." ] }, { @@ -286,15 +320,19 @@ }, "outputs": [], "source": [ - "if minikube:\n", - " !eval $(minikube docker-env) && s2i build . seldonio/seldon-core-s2i-python3:0.4 seldonio/outlier-vae:0.1" + "if MINIKUBE & MODEL:\n", + " !eval $(minikube docker-env) && \\\n", + " s2i build . seldonio/seldon-core-s2i-python3:0.4 seldonio/outlier-vae-model-demo:0.1\n", + "elif MINIKUBE:\n", + " !eval $(minikube docker-env) && \\\n", + " s2i build . seldonio/seldon-core-s2i-python3:0.4 seldonio/outlier-vae-transformer:0.1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Install outlier detector helm charts and set \"threshold\" and reservoir_size\" hyperparameter values." + "Install outlier detector helm charts either as a model or transformer and set *threshold* and *reservoir_size* hyperparameter values." ] }, { @@ -303,13 +341,30 @@ "metadata": {}, "outputs": [], "source": [ - "!helm install ../../../helm-charts/seldon-od-vae \\\n", - " --set model.image.name=seldonio/outlier-vae:0.1 \\\n", - " --set model.threshold=10 \\\n", - " --set model.reservoir_size=50000 \\\n", - " --name outlier-detector --set oauth.key=oauth-key \\\n", - " --set oauth.secret=oauth-secret \\\n", - " --namespace=seldon" + "if MODEL:\n", + " !helm install ../../../helm-charts/seldon-od-model \\\n", + " --name outlier-detector \\\n", + " --namespace=seldon \\\n", + " --set model.type=vae \\\n", + " --set model.vae.image.name=seldonio/outlier-vae-model-demo:0.1 \\\n", + " --set model.vae.threshold=10 \\\n", + " --set model.vae.reservoir_size=50000 \\\n", + " --set oauth.key=oauth-key \\\n", + " --set oauth.secret=oauth-secret \\\n", + " --set replicas=1\n", + "else:\n", + " !helm install ../../../helm-charts/seldon-od-transformer \\\n", + " --name outlier-detector \\\n", + " --namespace=seldon \\\n", + " --set outlierDetection.enabled=true \\\n", + " --set outlierDetection.name=outlier-vae \\\n", + " --set outlierDetection.type=vae \\\n", + " --set outlierDetection.vae.image.name=seldonio/outlier-vae-transformer:0.1 \\\n", + " --set outlierDetection.vae.threshold=10 \\\n", + " --set outlierDetection.vae.reservoir_size=50000 \\\n", + " --set oauth.key=oauth-key \\\n", + " --set oauth.secret=oauth-secret \\\n", + " --set model.image.name=seldonio/mock_classifier:1.0" ] }, { @@ -398,13 +453,21 @@ "response = rest_request_ambassador(\"outlier-detector\",request,endpoint=\"localhost:8003\")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the outlier detector is used as a transformer, the output of the anomaly detection is added as part of the metadata. If it is used as a model, we send model feedback to retrieve custom performance metrics." + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "send_feedback_rest(\"outlier-detector\",request,response,0,labels,endpoint=\"localhost:8003\")" + "if MODEL:\n", + " send_feedback_rest(\"outlier-detector\",request,response,0,labels,endpoint=\"localhost:8003\")" ] }, { @@ -480,11 +543,11 @@ "- Sample random network intrusion data with a certain outlier probability.\n", "- Get payload for the observation.\n", "- Make a prediction.\n", - "- Send the \"true\" label with the feedback.\n", + "- Send the \"true\" label with the feedback if the detector is run as a model.\n", "\n", "It is important that the prediction-feedback order is maintained. Otherwise there will be a mismatch between the predicted and \"true\" labels.\n", "\n", - "View the progress on the grafana \"Outlier Detection\" dashboard." + "View the progress on the grafana \"Outlier Detection\" dashboard. Most metrics need the outlier detector to be run as a model since they need model feedback." ] }, { @@ -503,7 +566,8 @@ " X, labels = generate_batch(data,samples,fraction_outlier)\n", " request = get_payload(X)\n", " response = rest_request_ambassador(\"outlier-detector\",request,endpoint=\"localhost:8003\")\n", - " send_feedback_rest(\"outlier-detector\",request,response,0,labels,endpoint=\"localhost:8003\")\n", + " if MODEL:\n", + " send_feedback_rest(\"outlier-detector\",request,response,0,labels,endpoint=\"localhost:8003\")\n", " time.sleep(1)" ] }, @@ -513,7 +577,7 @@ "metadata": {}, "outputs": [], "source": [ - "if minikube:\n", + "if MINIKUBE:\n", " !minikube delete" ] }, diff --git a/components/outlier-detection/vae/outlier_vae_doc.ipynb b/components/outlier-detection/vae/outlier_vae_doc.ipynb deleted file mode 100644 index a8b0eee4e5..0000000000 --- a/components/outlier-detection/vae/outlier_vae_doc.ipynb +++ /dev/null @@ -1,442 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Variational Auto-Encoder Outlier (VAE) Algorithm Documentation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The aim of this document is to explain the Variational Auto-Encoder algorithm in Seldon's outlier detection framework.\n", - "\n", - "First, we provide a high level overview of the algorithm and the use case, then we will give a detailed explanation of the implementation." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Overview" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Outlier detection has many applications, ranging from preventing credit card fraud to detecting computer network intrusions. The available data is typically unlabeled and detection needs to be done in real-time. The outlier detector can be used as a standalone algorithm, or to detect anomalies in the input data of another predictive model.\n", - "\n", - "The VAE outlier detection algorithm predicts whether the input features are an outlier or not, dependent on a threshold level set by the user. The algorithm needs to be pretrained first on a batch of -preferably- inliers.\n", - "\n", - "As observations arrive, the algorithm will:\n", - "- scale (standardize or minmax) the input features\n", - "- first encode, and then decode the input data in an attempt to reconstruct the initial observations\n", - "- compute a reconstruction error between the output of the decoder and the input data\n", - "- predict that the observation is an outlier if the error is larger than the threshold level" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Why Variational Auto-Encoders?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "An Auto-Encoder is an algorithm that consists of 2 main building blocks: an encoder and a decoder. The encoder tries to find a compressed representation of the input data. The compressed data is then fed into the decoder, which aims to replicate the input data. Both the encoder and decoder are typically implemented with neural networks. The loss function to be minimized with stochastic gradient descent is a distance function between the input data and output of the decoder, and is called the reconstruction error.\n", - "\n", - "If we train the Auto-Encoder with inliers, it will be able to replicate new inlier data well with a low reconstruction error. However, if outliers are fed to the Auto-Encoder, the reconstruction error becomes large and we can classify the observation as an anomaly.\n", - "\n", - "A Variational Auto-Encoder adds constraints to the encoded representations of the input. The encodings are parameters of a probability distribution modeling the data. The decoder can then generate new data by sampling from the learned distribution." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Implementation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1. Building the VAE model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The VAE model definition in model.py takes 4 arguments that define the architecture:\n", - "- the number of features in the input\n", - "- the number of hidden layers used in the encoder and decoder\n", - "- the dimension of the latent variable\n", - "- the dimensions of each hidden layer" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "``` python\n", - "def model(n_features, hidden_layers=1, latent_dim=2, hidden_dim=[], \n", - " output_activation='sigmoid', learning_rate=0.001):\n", - " \"\"\" Build VAE model. \n", - " \n", - " Arguments:\n", - " - n_features (int): number of features in the data\n", - " - hidden_layers (int): number of hidden layers used in encoder/decoder\n", - " - latent_dim (int): dimension of latent variable\n", - " - hidden_dim (list): list with dimension of each hidden layer\n", - " - output_activation (str): activation type for last dense layer in the decoder\n", - " - learning_rate (float): learning rate used during training\n", - " \"\"\"\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "First, the input data feeds in the encoder and is compressed by mapping it on the latent space which defines the probability distribution of the encodings:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "``` python\n", - " # encoder\n", - " inputs = Input(shape=(n_features,), name='encoder_input')\n", - " # define hidden layers\n", - " enc_hidden = Dense(hidden_dim[0], activation='relu', name='encoder_hidden_0')(inputs)\n", - " i = 1\n", - " while i < hidden_layers:\n", - " enc_hidden = Dense(hidden_dim[i],activation='relu',name='encoder_hidden_'+str(i))(enc_hidden)\n", - " i+=1\n", - " \n", - " z_mean = Dense(latent_dim, name='z_mean')(enc_hidden)\n", - " z_log_var = Dense(latent_dim, name='z_log_var')(enc_hidden)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can then sample data from the latent space." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "``` python\n", - "def sampling(args):\n", - " \"\"\" Reparameterization trick by sampling from an isotropic unit Gaussian.\n", - " \n", - " Arguments:\n", - " - args (tensor): mean and log of variance of Q(z|X)\n", - " \n", - " Returns:\n", - " - z (tensor): sampled latent vector\n", - " \"\"\"\n", - " z_mean, z_log_var = args\n", - " batch = K.shape(z_mean)[0]\n", - " dim = K.int_shape(z_mean)[1]\n", - " epsilon = K.random_normal(shape=(batch, dim)) # by default, random_normal has mean=0 and std=1.0\n", - " return z_mean + K.exp(0.5 * z_log_var) * epsilon # mean + stdev * eps\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "``` python\n", - " # reparametrization trick to sample z\n", - " z = Lambda(sampling, output_shape=(latent_dim,), name='z')([z_mean, z_log_var])\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The sampled data passes through the decoder which aims to reconstruct the input." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "``` python\n", - " # decoder\n", - " latent_inputs = Input(shape=(latent_dim,), name='z_sampling')\n", - " # define hidden layers\n", - " dec_hidden = Dense(hidden_dim[-1], activation='relu', name='decoder_hidden_0')(latent_inputs)\n", - "\n", - " i = 2\n", - " while i < hidden_layers+1:\n", - " dec_hidden = Dense(hidden_dim[-i],activation='relu',name='decoder_hidden_'+str(i-1))(dec_hidden)\n", - " i+=1\n", - "\n", - " outputs = Dense(n_features, activation=output_activation, name='decoder_output')(dec_hidden)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The loss function is the sum of the reconstruction error and the KL-divergence. While the reconstruction error quantifies how well we can recreate the input data, the KL-divergence measures how close the latent representation is to the unit Gaussian distribution. This trade-off is important because we want our encodings to parameterize a probability distribution from which we can sample data." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "``` python\n", - " # define VAE loss, optimizer and compile model\n", - " reconstruction_loss = mse(inputs, outputs)\n", - " reconstruction_loss *= n_features\n", - " kl_loss = 1 + z_log_var - K.square(z_mean) - K.exp(z_log_var)\n", - " kl_loss = K.sum(kl_loss, axis=-1)\n", - " kl_loss *= -0.5\n", - " vae_loss = K.mean(reconstruction_loss + kl_loss)\n", - " vae.add_loss(vae_loss)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2. Training the model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The VAE model can be trained on a batch of inliers by running the train.py script with the desired hyperparameters:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "``` python\n", - "!python train.py \\\n", - "--dataset 'kddcup99' \\\n", - "--samples 50000 \\\n", - "--keep_cols \"$cols_str\" \\\n", - "--hidden_layers 1 \\\n", - "--latent_dim 2 \\\n", - "--hidden_dim 9 \\\n", - "--output_activation 'sigmoid' \\\n", - "--clip 999999 \\\n", - "--standardized \\\n", - "--epochs 10 \\\n", - "--batch_size 32 \\\n", - "--learning_rate 0.001 \\\n", - "--print_progress \\\n", - "--model_name 'vae' \\\n", - "--save \\\n", - "--save_path './models/'\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The model weights and hyperparameters are saved in the folder specified by \"save_path\"." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 3. Making predictions" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In order to make predictions, which can then be served by Seldon Core, the pre-trained model weights and hyperparameters are loaded when defining an OutlierVAE object. The \"threshold\" argument defines above which reconstruction error a sample is classified as an outlier. The threshold is a key hyperparameter and needs to be picked carefully for each application." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "``` python\n", - "class OutlierVAE(object):\n", - " \"\"\" Outlier detection using variational autoencoders (VAE).\n", - " \n", - " Arguments:\n", - " - threshold: (float): reconstruction error (mse) threshold used to classify outliers\n", - " - reservoir_size (int): number of observations kept in memory using reservoir sampling used for mean and stdev\n", - " \n", - " Functions:\n", - " - reservoir_sampling: applies reservoir sampling to incoming data\n", - " - predict: detect and return outliers\n", - " - send_feedback: add target labels as part of the feedback loop\n", - " - metrics: return custom metrics\n", - " \"\"\"\n", - " def __init__(self,threshold=10,reservoir_size=50000,model_name='vae',load_path='./models/'):\n", - " \n", - " self.threshold = threshold\n", - " self.reservoir_size = reservoir_size\n", - " self.batch = []\n", - " self.N = 0 # total sample count up until now for reservoir sampling\n", - " \n", - " # load model architecture parameters\n", - " with open(load_path + model_name + '.pickle', 'rb') as f:\n", - " n_features, hidden_layers, latent_dim, hidden_dim, output_activation = pickle.load(f)\n", - " \n", - " # instantiate model\n", - " self.vae = model(n_features,hidden_layers=hidden_layers,latent_dim=latent_dim,\n", - " hidden_dim=hidden_dim,output_activation=output_activation)\n", - " self.vae.load_weights(load_path + model_name + '_weights.h5') # load pretrained model weights\n", - " self.vae._make_predict_function()\n", - " \n", - " # load data preprocessing info\n", - " with open(load_path + 'preprocess_' + model_name + '.pickle', 'rb') as f:\n", - " preprocess = pickle.load(f)\n", - " self.preprocess, self.clip, self.axis = preprocess[:3]\n", - " if self.preprocess=='minmax':\n", - " self.xmin, self.xmax = preprocess[3:5]\n", - " self.min, self.max = preprocess[5:]\n", - " elif self.preprocess=='standardized':\n", - " self.mu, self.sigma = preprocess[3:]\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The predict method does the actual outlier detection." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "``` python\n", - " def predict(self,X,feature_names):\n", - " \"\"\" Detect outliers from mse using the threshold. \n", - " \n", - " Arguments:\n", - " - X: input data\n", - " - feature_names\n", - " \"\"\"\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "First, the observations are clipped. If the number of observations fed to the outlier detector up until now is at least equal to the defined reservoir size, the feature-wise scaling parameters are updated using the observations in the reservoir. The reservoir is updated each observation using reservoir sampling. The input data is then scaled using either standardization or minmax scaling." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "``` python\n", - " # clip data per feature\n", - " X = np.clip(X,[-c for c in self.clip],self.clip)\n", - " \n", - " if self.N < self.reservoir_size:\n", - " update_stand = False\n", - " else:\n", - " update_stand = True\n", - " \n", - " self.reservoir_sampling(X,update_stand=update_stand)\n", - " \n", - " # apply scaling\n", - " if self.preprocess=='minmax':\n", - " X_scaled = ((X - self.xmin) / (self.xmax - self.xmin)) * (self.max - self.min) + self.min\n", - " elif self.preprocess=='standardized':\n", - " X_scaled = (X - self.mu) / (self.sigma + 1e-10)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We then make multiple predictions for an observation by sampling N times from the latent space. The mean squared error between the input data and output of the decoder is averaged across the N samples. If this value is above the threshold, an outlier is predicted." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "``` python\n", - " # sample latent variables and calculate reconstruction errors\n", - " N = 10\n", - " mse = np.zeros([X.shape[0],N])\n", - " for i in range(N):\n", - " preds = self.vae.predict(X_scaled)\n", - " mse[:,i] = np.mean(np.power(X_scaled - preds, 2), axis=1)\n", - " self.mse = np.mean(mse, axis=1)\n", - " \n", - " # make prediction\n", - " self.prediction = np.array([1 if e > self.threshold else 0 for e in self.mse]).astype(int)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## References" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. ICLR 2014.\n", - "- https://arxiv.org/pdf/1312.6114.pdf\n", - "\n", - "Francois Chollet. Building Autoencoders in Keras.\n", - "- https://blog.keras.io/building-autoencoders-in-keras.html" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.6" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/helm-charts/Makefile b/helm-charts/Makefile index ca538771aa..96abcf1cc3 100644 --- a/helm-charts/Makefile +++ b/helm-charts/Makefile @@ -1,4 +1,4 @@ -CHARTS=seldon-core-crd seldon-core seldon-core-analytics seldon-core-kafka seldon-core-loadtesting seldon-single-model seldon-abtest seldon-mab seldon-od-vae seldon-od-if seldon-od-s2s-lstm seldon-od-md +CHARTS=seldon-core-crd seldon-core seldon-core-analytics seldon-core-kafka seldon-core-loadtesting seldon-single-model seldon-abtest seldon-mab seldon-od-model seldon-od-transformer build_all: diff --git a/helm-charts/README.md b/helm-charts/README.md index fb06e579d0..920e8d08cc 100644 --- a/helm-charts/README.md +++ b/helm-charts/README.md @@ -26,8 +26,8 @@ A set of charts to provide example templates for creating particular inference g * Serve a multi-armed bandit between two models. * seldon-openvino * Deploy a single model with Intel OpenVINO model server. - * seldon-od-if, seldon-od-vae, seldon-od-s2s-lstm, seldon-od-md - * Serve one of the following Outlier Detector components: + * seldon-od-model and seldon-od-transformer + * Serve one of the following Outlier Detector components as either models or transformers: * [Isolation Forest](../components/outlier-detection/isolation-forest) * [Variational Auto-Encoder](../components/outlier-detection/vae) * [Sequence-to-Sequence-LSTM](../components/outlier-detection/seq2seq-lstm) diff --git a/helm-charts/seldon-od-if/Chart.yaml b/helm-charts/seldon-od-if/Chart.yaml deleted file mode 100644 index cf10db4c12..0000000000 --- a/helm-charts/seldon-od-if/Chart.yaml +++ /dev/null @@ -1,9 +0,0 @@ -apiVersion: v1 -description: Seldon Core isolation forest outlier detection model template -keywords: -- kubernetes -- machine-learning -name: seldon-od-if -sources: -- https://github.com/SeldonIO/seldon-core -version: 0.1 diff --git a/helm-charts/seldon-od-if/README.md b/helm-charts/seldon-od-if/README.md deleted file mode 100644 index 581f500ffa..0000000000 --- a/helm-charts/seldon-od-if/README.md +++ /dev/null @@ -1,4 +0,0 @@ -# Single Model with Outlier Detector - -This chart provides a Seldon Deployment with an outlier detector using Isolation Forests. - diff --git a/helm-charts/seldon-od-if/templates/model.json b/helm-charts/seldon-od-if/templates/model.json deleted file mode 100644 index 64f91d93e1..0000000000 --- a/helm-charts/seldon-od-if/templates/model.json +++ /dev/null @@ -1,59 +0,0 @@ -{ - "apiVersion": "machinelearning.seldon.io/v1alpha2", - "kind": "SeldonDeployment", - "metadata": { - "labels": { - "app": "seldon" - }, - "name": "{{ .Release.Name }}" - }, - "spec": { - "name": "{{ .Release.Name }}", -{{- if .Values.oauth.key }} - "oauth_key": "{{ .Values.oauth.key }}", - "oauth_secret": "{{ .Values.oauth.secret }}", -{{- end }} - "predictors": [ - { - "componentSpecs": [{ - "spec": { - "containers": [ - { - "image": "{{ .Values.model.image.name }}", - "imagePullPolicy": "IfNotPresent", - "name": "{{ .Values.model.name }}", - "resources": { - "requests": { - "memory": "1Mi" - } - } - } - ], - "terminationGracePeriodSeconds": 1 - }} - ], - "graph": - { - "children": [], - "name": "{{ .Values.model.name }}", - "endpoint": { - "type" : "REST" - }, - "type": "MODEL", - "parameters": [ - { - "name": "threshold", - "value": "{{ .Values.model.threshold }}", - "type": "FLOAT" - } - ], - }, - "name": "{{ .Release.Name }}", - "replicas": {{ .Values.replicas }}, - "labels": { - "version" : "v1" - } - } - ] - } -} diff --git a/helm-charts/seldon-od-if/values.yaml b/helm-charts/seldon-od-if/values.yaml deleted file mode 100644 index 148b43de57..0000000000 --- a/helm-charts/seldon-od-if/values.yaml +++ /dev/null @@ -1,11 +0,0 @@ -name: outlier-detector-if -model: - image: - name: seldonio/outlier-if:0.1 - name: outlier-if - threshold: 0 -replicas: 1 -# Add oauth key and secret if using the default API Oauth Gateway for ingress -oauth: - key: - secret: diff --git a/helm-charts/seldon-od-md/README.md b/helm-charts/seldon-od-md/README.md deleted file mode 100644 index 26bde0b2b8..0000000000 --- a/helm-charts/seldon-od-md/README.md +++ /dev/null @@ -1,4 +0,0 @@ -# Single Model with Outlier Detector - -This chart provides a Seldon Deployment with an outlier detector using the Mahalanobis distance. - diff --git a/helm-charts/seldon-od-md/values.yaml b/helm-charts/seldon-od-md/values.yaml deleted file mode 100644 index 8db89372b3..0000000000 --- a/helm-charts/seldon-od-md/values.yaml +++ /dev/null @@ -1,15 +0,0 @@ -name: outlier-detector-md -model: - image: - name: seldonio/outlier-mahalanobis:0.1 - name: outlier-mahalanobis - threshold: 25 - n_components: 3 - n_stdev: 3 - start_clip: 50 - max_n: -1 -replicas: 1 -# Add oauth key and secret if using the default API Oauth Gateway for ingress -oauth: - key: - secret: diff --git a/helm-charts/seldon-od-md/Chart.yaml b/helm-charts/seldon-od-model/Chart.yaml similarity index 89% rename from helm-charts/seldon-od-md/Chart.yaml rename to helm-charts/seldon-od-model/Chart.yaml index f481977fa8..3b73d1495f 100644 --- a/helm-charts/seldon-od-md/Chart.yaml +++ b/helm-charts/seldon-od-model/Chart.yaml @@ -3,7 +3,7 @@ description: Seldon Core outlier detection model template keywords: - kubernetes - machine-learning -name: seldon-od-md +name: seldon-od-model sources: - https://github.com/SeldonIO/seldon-core version: 0.1 diff --git a/helm-charts/seldon-od-model/README.md b/helm-charts/seldon-od-model/README.md new file mode 100644 index 0000000000..6fe5e0961f --- /dev/null +++ b/helm-charts/seldon-od-model/README.md @@ -0,0 +1,10 @@ +# Outlier Detector as Single Model + +This chart provides a Seldon Deployment with an outlier detector used as a single model. + +Available outlier detectors are: +- [Sequence-to-Sequence LSTM](../../components/outlier-detection/seq2seq-lstm) +- [Variational Auto-Encoder](../../components/outlier-detection/vae) +- [Isolation Forest](../../components/outlier-detection/isolation-forest) +- [Mahalanobis Distance](../../components/outlier-detection/mahalanobis) + diff --git a/helm-charts/seldon-od-md/templates/model.json b/helm-charts/seldon-od-model/templates/model.json similarity index 62% rename from helm-charts/seldon-od-md/templates/model.json rename to helm-charts/seldon-od-model/templates/model.json index 272b789c00..6ae2e58b30 100644 --- a/helm-charts/seldon-od-md/templates/model.json +++ b/helm-charts/seldon-od-model/templates/model.json @@ -1,3 +1,13 @@ +{{- if eq .Values.model.type "vae"}} +{{- $dummy := set . "detector" .Values.model.vae -}} +{{- else if eq .Values.model.type "mahalanobis"}} +{{- $dummy := set . "detector" .Values.model.mahalanobis -}} +{{- else if eq .Values.model.type "seq2seq"}} +{{- $dummy := set . "detector" .Values.model.seq2seq -}} +{{- else if eq .Values.model.type "isolationforest"}} +{{- $dummy := set . "detector" .Values.model.isolationforest -}} +{{- end }} +{{- $type := .Values.model.parameterTypes -}} { "apiVersion": "machinelearning.seldon.io/v1alpha2", "kind": "SeldonDeployment", @@ -19,7 +29,7 @@ "spec": { "containers": [ { - "image": "{{ .Values.model.image.name }}", + "image": {{ .detector.image.name | quote }}, "imagePullPolicy": "IfNotPresent", "name": "{{ .Values.model.name }}", "resources": { @@ -32,7 +42,7 @@ "terminationGracePeriodSeconds": 1 }} ], - "graph": + "graph": { "children": [], "name": "{{ .Values.model.name }}", @@ -40,35 +50,17 @@ "type" : "REST" }, "type": "MODEL", - "parameters": [ + "parameters": [ +{{- $lastKey := last (keys (unset .detector "image") | sortAlpha) -}} +{{- range $key, $val := .detector }} { - "name": "threshold", - "value": "{{ .Values.model.threshold }}", - "type": "FLOAT" - }, - { - "name": "n_components", - "value": "{{ .Values.model.n_components }}", - "type": "INT" - }, - { - "name": "n_stdev", - "value": "{{ .Values.model.n_stdev }}", - "type": "FLOAT" - }, - { - "name": "start_clip", - "value": "{{ .Values.model.start_clip }}", - "type": "INT" - }, - { - "name": "max_n", - "value": "{{ .Values.model.max_n }}", - "type": "INT" - } - - ], - }, + "name": {{ $key | quote }}, + "value": {{ $val | quote }}, + "type": {{ index $type $key | quote }} + }{{- if ne $key $lastKey -}}, {{ end }} +{{- end }} + ] + }, "name": "{{ .Release.Name }}", "replicas": {{ .Values.replicas }}, "labels": { diff --git a/helm-charts/seldon-od-model/values.yaml b/helm-charts/seldon-od-model/values.yaml new file mode 100644 index 0000000000..1f59307928 --- /dev/null +++ b/helm-charts/seldon-od-model/values.yaml @@ -0,0 +1,46 @@ +name: seldon-od-model +model: + name: outlier-detector + type: vae + vae: + threshold: 10 + reservoir_size: 50000 + model_name: vae + load_path: ./models/ + image: + name: seldonio/outlier-vae-model:0.1 + mahalanobis: + threshold: 25 + n_components: 3 + n_stdev: 3 + start_clip: 50 + max_n: -1 + image: + name: seldonio/outlier-mahalanobis-model:0.1 + seq2seq: + threshold: 0.003 + reservoir_size: 50000 + model_name: seq2seq + load_path: ./models/ + image: + name: seldonio/outlier-s2s-lstm-model:0.1 + isolationforest: + threshold: 0 + model_name: if + load_path: ./models/ + image: + name: seldonio/outlier-if-model:0.1 + parameterTypes: + threshold: FLOAT + reservoir_size: INT + model_name: STRING + load_path: STRING + n_components: INT + n_stdev: FLOAT + start_clip: INT + max_n: INT +replicas: 1 +# Add oauth key and secret if using the default API Oauth Gateway for ingress +oauth: + key: + secret: diff --git a/helm-charts/seldon-od-s2s-lstm/README.md b/helm-charts/seldon-od-s2s-lstm/README.md deleted file mode 100644 index 3275118e44..0000000000 --- a/helm-charts/seldon-od-s2s-lstm/README.md +++ /dev/null @@ -1,4 +0,0 @@ -# Single Model with Outlier Detector - -This chart provides a Seldon Deployment with an outlier detector using a Sequence-to-Sequence LSTM model. - diff --git a/helm-charts/seldon-od-s2s-lstm/templates/model.json b/helm-charts/seldon-od-s2s-lstm/templates/model.json deleted file mode 100644 index ac2dc8b743..0000000000 --- a/helm-charts/seldon-od-s2s-lstm/templates/model.json +++ /dev/null @@ -1,64 +0,0 @@ -{ - "apiVersion": "machinelearning.seldon.io/v1alpha2", - "kind": "SeldonDeployment", - "metadata": { - "labels": { - "app": "seldon" - }, - "name": "{{ .Release.Name }}" - }, - "spec": { - "name": "{{ .Release.Name }}", -{{- if .Values.oauth.key }} - "oauth_key": "{{ .Values.oauth.key }}", - "oauth_secret": "{{ .Values.oauth.secret }}", -{{- end }} - "predictors": [ - { - "componentSpecs": [{ - "spec": { - "containers": [ - { - "image": "{{ .Values.model.image.name }}", - "imagePullPolicy": "IfNotPresent", - "name": "{{ .Values.model.name }}", - "resources": { - "requests": { - "memory": "1Mi" - } - } - } - ], - "terminationGracePeriodSeconds": 1 - }} - ], - "graph": - { - "children": [], - "name": "{{ .Values.model.name }}", - "endpoint": { - "type" : "REST" - }, - "type": "MODEL", - "parameters": [ - { - "name": "threshold", - "value": "{{ .Values.model.threshold }}", - "type": "FLOAT" - }, - { - "name": "reservoir_size", - "value": "{{ .Values.model.reservoir_size }}", - "type": "INT" - } - ], - }, - "name": "{{ .Release.Name }}", - "replicas": {{ .Values.replicas }}, - "labels": { - "version" : "v1" - } - } - ] - } -} diff --git a/helm-charts/seldon-od-s2s-lstm/values.yaml b/helm-charts/seldon-od-s2s-lstm/values.yaml deleted file mode 100644 index a70adb0e3c..0000000000 --- a/helm-charts/seldon-od-s2s-lstm/values.yaml +++ /dev/null @@ -1,12 +0,0 @@ -name: outlier-detector-s2s-lstm -model: - image: - name: seldonio/outlier-s2s-lstm:0.1 - name: outlier-s2s-lstm - threshold: 0.003 - reservoir_size: 50000 -replicas: 1 -# Add oauth key and secret if using the default API Oauth Gateway for ingress -oauth: - key: - secret: diff --git a/helm-charts/seldon-od-s2s-lstm/Chart.yaml b/helm-charts/seldon-od-transformer/Chart.yaml similarity index 56% rename from helm-charts/seldon-od-s2s-lstm/Chart.yaml rename to helm-charts/seldon-od-transformer/Chart.yaml index 2e2d8bd071..14aa14f34a 100644 --- a/helm-charts/seldon-od-s2s-lstm/Chart.yaml +++ b/helm-charts/seldon-od-transformer/Chart.yaml @@ -1,9 +1,9 @@ apiVersion: v1 -description: Seldon Core outlier detection model template +description: Seldon Core outlier detection transformer template keywords: - kubernetes - machine-learning -name: seldon-od-s2s-lstm +name: seldon-od-transformer sources: - https://github.com/SeldonIO/seldon-core version: 0.1 diff --git a/helm-charts/seldon-od-transformer/README.md b/helm-charts/seldon-od-transformer/README.md new file mode 100644 index 0000000000..f10ed1a7cb --- /dev/null +++ b/helm-charts/seldon-od-transformer/README.md @@ -0,0 +1,10 @@ +# Outlier Detector as Transformer + +This chart provides a Seldon Deployment with an outlier detector used as a transformer with a single model. + +Available outlier detectors are: +- [Sequence-to-Sequence LSTM](../../components/outlier-detection/seq2seq-lstm) +- [Variational Auto-Encoder](../../components/outlier-detection/vae) +- [Isolation Forest](../../components/outlier-detection/isolation-forest) +- [Mahalanobis Distance](../../components/outlier-detection/mahalanobis) + diff --git a/helm-charts/seldon-od-transformer/templates/model.json b/helm-charts/seldon-od-transformer/templates/model.json new file mode 100644 index 0000000000..70fca59268 --- /dev/null +++ b/helm-charts/seldon-od-transformer/templates/model.json @@ -0,0 +1,104 @@ +{{- if eq .Values.outlierDetection.type "vae"}} +{{- $dummy := set . "detector" .Values.outlierDetection.vae -}} +{{- else if eq .Values.outlierDetection.type "mahalanobis"}} +{{- $dummy := set . "detector" .Values.outlierDetection.mahalanobis -}} +{{- else if eq .Values.outlierDetection.type "seq2seq"}} +{{- $dummy := set . "detector" .Values.outlierDetection.seq2seq -}} +{{- else if eq .Values.outlierDetection.type "isolationforest"}} +{{- $dummy := set . "detector" .Values.outlierDetection.isolationforest -}} +{{- end }} +{{- $type := .Values.outlierDetection.parameterTypes -}} +{ + "apiVersion": "machinelearning.seldon.io/v1alpha2", + "kind": "SeldonDeployment", + "metadata": { + "labels": { + "app": "seldon" + }, + "name": "{{ .Release.Name }}" + }, + "spec": { + "name": "{{ .Release.Name }}", +{{- if .Values.oauth.key }} + "oauth_key": "{{ .Values.oauth.key }}", + "oauth_secret": "{{ .Values.oauth.secret }}", +{{- end }} + "predictors": [ + { + "componentSpecs": [{ + "spec": { + "containers": [ + { + "image": "{{ .Values.model.image.name }}", + "imagePullPolicy": "IfNotPresent", + "name": "{{ .Values.model.name }}", + "resources": { + "requests": { + "memory": "1Mi" + } + } + } + ], + "terminationGracePeriodSeconds": 1 + }} +{{- if .Values.outlierDetection.enabled }} + , + { + "spec": { + "containers": [ + { + "image": {{ .detector.image.name | quote }}, + "imagePullPolicy": "IfNotPresent", + "name": "{{ .Values.outlierDetection.name }}", + "resources": { + "requests": { + "memory": "1Mi" + } + } + } + ], + "terminationGracePeriodSeconds": 20 + } + } +{{- end }} + ], + "graph": +{{- if .Values.outlierDetection.enabled }} + { + "name": "{{ .Values.outlierDetection.name }}", + "type": "TRANSFORMER", + "parameters": [ +{{- $lastKey := last (keys (unset .detector "image") | sortAlpha) -}} +{{- range $key, $val := .detector }} + { + "name": {{ $key | quote }}, + "value": {{ $val | quote }}, + "type": {{ index $type $key | quote }} + }{{- if ne $key $lastKey -}}, {{ end }} +{{- end }} + ], + "endpoint": { + "type": "REST" + }, + "children": [ +{{- end }} + { + "children": [], + "name": "{{ .Values.model.name }}", + "endpoint": { + "type" : "REST" + }, + "type": "MODEL" + } +{{- if .Values.outlierDetection.enabled }} + ]} +{{- end }}, + "name": "{{ .Release.Name }}", + "replicas": {{ .Values.replicas }}, + "labels": { + "version" : "v1" + } + } + ] + } +} diff --git a/helm-charts/seldon-od-transformer/values.yaml b/helm-charts/seldon-od-transformer/values.yaml new file mode 100644 index 0000000000..f998a4925c --- /dev/null +++ b/helm-charts/seldon-od-transformer/values.yaml @@ -0,0 +1,51 @@ +name: seldon-od-transformer +model: + image: + name: seldonio/mock_classifier:1.0 + name: classifier +outlierDetection: + enabled: true + name: outlier-detector + type: vae + vae: + threshold: 10 + reservoir_size: 50000 + model_name: vae + load_path: ./models/ + image: + name: seldonio/outlier-vae-tranformer:0.1 + mahalanobis: + threshold: 25 + n_components: 3 + n_stdev: 3 + start_clip: 50 + max_n: -1 + image: + name: seldonio/outlier-mahalanobis-tranformer:0.1 + seq2seq: + threshold: 0.003 + reservoir_size: 50000 + model_name: seq2seq + load_path: ./models/ + image: + name: seldonio/outlier-s2s-lstm-tranformer:0.1 + isolationforest: + threshold: 0 + model_name: if + load_path: ./models/ + image: + name: seldonio/outlier-if-tranformer:0.1 + parameterTypes: + threshold: FLOAT + reservoir_size: INT + model_name: STRING + load_path: STRING + n_components: INT + n_stdev: FLOAT + start_clip: INT + max_n: INT +replicas: 1 +# Add oauth key and secret if using the default API Oauth Gateway for ingress +oauth: + key: + secret: diff --git a/helm-charts/seldon-od-vae/Chart.yaml b/helm-charts/seldon-od-vae/Chart.yaml deleted file mode 100644 index 96cda7b15a..0000000000 --- a/helm-charts/seldon-od-vae/Chart.yaml +++ /dev/null @@ -1,9 +0,0 @@ -apiVersion: v1 -description: Seldon Core outlier detection model template -keywords: -- kubernetes -- machine-learning -name: seldon-od-vae -sources: -- https://github.com/SeldonIO/seldon-core -version: 0.1 diff --git a/helm-charts/seldon-od-vae/README.md b/helm-charts/seldon-od-vae/README.md deleted file mode 100644 index a1c093822e..0000000000 --- a/helm-charts/seldon-od-vae/README.md +++ /dev/null @@ -1,4 +0,0 @@ -# Single Model with Outlier Detector - -This chart provides a Seldon Deployment with an outlier detector using a Variational Auto-Encoder (VAE). - diff --git a/helm-charts/seldon-od-vae/templates/model.json b/helm-charts/seldon-od-vae/templates/model.json deleted file mode 100644 index ac2dc8b743..0000000000 --- a/helm-charts/seldon-od-vae/templates/model.json +++ /dev/null @@ -1,64 +0,0 @@ -{ - "apiVersion": "machinelearning.seldon.io/v1alpha2", - "kind": "SeldonDeployment", - "metadata": { - "labels": { - "app": "seldon" - }, - "name": "{{ .Release.Name }}" - }, - "spec": { - "name": "{{ .Release.Name }}", -{{- if .Values.oauth.key }} - "oauth_key": "{{ .Values.oauth.key }}", - "oauth_secret": "{{ .Values.oauth.secret }}", -{{- end }} - "predictors": [ - { - "componentSpecs": [{ - "spec": { - "containers": [ - { - "image": "{{ .Values.model.image.name }}", - "imagePullPolicy": "IfNotPresent", - "name": "{{ .Values.model.name }}", - "resources": { - "requests": { - "memory": "1Mi" - } - } - } - ], - "terminationGracePeriodSeconds": 1 - }} - ], - "graph": - { - "children": [], - "name": "{{ .Values.model.name }}", - "endpoint": { - "type" : "REST" - }, - "type": "MODEL", - "parameters": [ - { - "name": "threshold", - "value": "{{ .Values.model.threshold }}", - "type": "FLOAT" - }, - { - "name": "reservoir_size", - "value": "{{ .Values.model.reservoir_size }}", - "type": "INT" - } - ], - }, - "name": "{{ .Release.Name }}", - "replicas": {{ .Values.replicas }}, - "labels": { - "version" : "v1" - } - } - ] - } -} diff --git a/helm-charts/seldon-od-vae/values.yaml b/helm-charts/seldon-od-vae/values.yaml deleted file mode 100644 index df6cde779b..0000000000 --- a/helm-charts/seldon-od-vae/values.yaml +++ /dev/null @@ -1,12 +0,0 @@ -name: outlier-detector-vae -model: - image: - name: seldonio/outlier-vae:0.1 - name: outlier-vae - threshold: 10 - reservoir_size: 50000 -replicas: 1 -# Add oauth key and secret if using the default API Oauth Gateway for ingress -oauth: - key: - secret: diff --git a/readme.md b/readme.md index b6021afb6e..38936e478a 100644 --- a/readme.md +++ b/readme.md @@ -86,7 +86,7 @@ Seldon allows you to build up runtime inference graphs that provide powerful opt * **Multi-Armed Bandits** * [Epsilon-greedy multi-armed bandits for real time optimization of models](components/routers/epsilon-greedy) ([GCP example](https://github.com/SeldonIO/seldon-core/blob/master/notebooks/epsilon_greedy_gcp.ipynb), [Kubeflow example](https://github.com/kubeflow/example-seldon)) * [Thompson sampling multi-armed bandit](components/routers/thompson-sampling) ([Credit card default case study](components/routers/case_study/credit_card_default.ipynb)) - * **Outlier Detection** + * [**Outlier Detection**](components/outlier-detection/README.md) * [Variational Auto-Encoder (VAE) Outlier Detector](https://github.com/SeldonIO/seldon-core/tree/master/components/outlier-detection/vae) * [Sequence-to-Sequence LSTM (seq2seq-LSTM) Outlier Detector](https://github.com/SeldonIO/seldon-core/tree/master/components/outlier-detection/seq2seq-lstm) * [Isolation Forest Outlier Detector](https://github.com/SeldonIO/seldon-core/tree/master/components/outlier-detection/isolation-forest)