Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add root cause localization transformer #4925

Merged
merged 51 commits into from
May 11, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
d5ee205
add root cause localization transformer
suxi-ms Mar 10, 2020
f727a79
add test cases
suxi-ms Mar 16, 2020
92de1dc
revert sln changes
suxi-ms Mar 18, 2020
798289c
add evaluation
suxi-ms Mar 18, 2020
f2e128d
temp save for internal review
suxi-ms Mar 20, 2020
51569e3
rename function
suxi-ms Mar 20, 2020
59c6e89
temp save bottom up points for switch desktop
suxi-ms Mar 22, 2020
29216e0
update from laptop
suxi-ms Mar 22, 2020
69da330
save for add test
suxi-ms Mar 23, 2020
e1c5432
add root cause localization algorithm
suxi-ms Mar 23, 2020
3a1d1c5
add root cause localization algorithm
suxi-ms Mar 23, 2020
8f97602
print score, path and directions in sample
suxi-ms Mar 23, 2020
48123f4
merge with master
suxi-ms Mar 23, 2020
c47302f
extract root cause analyzer
suxi-ms Mar 23, 2020
b07ad28
refine code
suxi-ms Mar 23, 2020
c729877
merge with master
suxi-ms Mar 24, 2020
ebbdb0d
update for algorithm
suxi-ms Mar 26, 2020
0d43b0d
add evaluatin
suxi-ms Mar 26, 2020
5778eed
some refine for code
suxi-ms Mar 26, 2020
c9ed044
fix some typo
suxi-ms Mar 27, 2020
e440f25
remove unused code
suxi-ms Mar 27, 2020
feba6f4
reformat code
suxi-ms Mar 27, 2020
686831c
updates
suxi-ms Mar 27, 2020
ddc8a36
update from review
suxi-ms Mar 29, 2020
475ee8a
read double for beta
suxi-ms Apr 1, 2020
8d874ca
remove SignatureDataTransform constructor
suxi-ms Apr 1, 2020
0674ab3
update
suxi-ms Apr 1, 2020
4c5b8fb
update
suxi-ms Apr 1, 2020
08d607c
remove white space
suxi-ms Apr 2, 2020
c688233
refine internal logic
suxi-ms Apr 7, 2020
98637db
update
suxi-ms Apr 8, 2020
4ff2ed1
update
suxi-ms Apr 8, 2020
c22ad50
updated test
suxi-ms Apr 13, 2020
ea7ddbe
update score
suxi-ms Apr 15, 2020
547aef2
update variable name
suxi-ms Apr 17, 2020
8d17c3c
add some comments
suxi-ms Apr 21, 2020
66b614a
refine internal function
suxi-ms Apr 23, 2020
12e7e18
handle for infinity and nan
suxi-ms Apr 24, 2020
e213615
rename the algorithm by removing DT
suxi-ms Apr 26, 2020
30915cd
Update src/Microsoft.ML.TimeSeries/RootCauseAnalyzer.cs
suxi-ms Apr 27, 2020
fda4ec7
fix type
suxi-ms Apr 27, 2020
620ef58
add an else branch when delta is negative
suxi-ms Apr 27, 2020
ae5722f
Merge branch 'master' of https://github.com/suxi-ms/machinelearning
suxi-ms Apr 27, 2020
7f89fea
update model signature
suxi-ms Apr 28, 2020
42dcbc2
update rca interface by removing transformer
suxi-ms May 7, 2020
9893fad
add more documents
suxi-ms May 7, 2020
c831e43
update
suxi-ms May 8, 2020
16f5b33
update
suxi-ms May 9, 2020
9cd8739
update the constructor
suxi-ms May 9, 2020
f80c200
update comments
suxi-ms May 9, 2020
7c1c348
fix typo
suxi-ms May 11, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions docs/api-reference/time-series-root-cause-localization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
At Mircosoft, we develop a decision tree based root cause localization method which helps to find out the root causes for an anomaly incident at a specific timestamp incrementally.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, "Microsoft." Also, it's a bit nonstandard to use present tense for "we develop." I would expect "we have developed" if the work is completed or "we maintain" if the work is ongoing.


## Multi-Dimensional Root Cause Localization
It's a common case that one measure is collected with many dimensions (*e.g.*, Province, ISP) whose values are categorical(*e.g.*, Beijing or Shanghai for dimension Province). When a measure's value deviates from its expected value, this measure encounters anomalies. In such case, operators would like to localize the root cause dimension combinations rapidly and accurately. Multi-dimensional root cause localization is critical to troubleshoot and mitigate such case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use "users" instead of "operators."


## Algorithm

The decision tree based root cause localization method is unsupervised, which means training step is not needed. It consists of the following major steps:

(1) Find the best dimension which divides the anomalous and regular data based on decision tree according to entropy gain and entropy gain ratio.

(2) Find the top anomaly points which contribute the most to anomaly incident given the selected best dimension.

### Decision Tree

[Decision tree](https://en.wikipedia.org/wiki/Decision_tree) algorithm chooses the highest information gain to split or construct a decision tree.  We use it to choose the dimension which contributes the most to the anomaly. Following are some concepts used in decision tree.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's non-standard to omit articles here. Try something like "The Decision Tree algorithm chooses..." and "Below are some concepts used in decision trees"


#### Information Entropy

Information [entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) is a measure of disorder or uncertainty. You can think of it as a measure of purity as well. The less the value , the more pure of data D.

$$Ent(D) = - \sum_{k=1}^{|y|} p_k\log_2(p_k) $$

where $p_k$ represents the probability of an element in dataset. In our case, there are only two classes, the anomalous points and the regular points. $|y|$ is the count of total anomalies.

#### Information Gain
[Information gain](https://en.wikipedia.org/wiki/Information_gain_in_decision_trees) is a metric to measure the reduction of this disorder in our target class given additional information about it. Mathematically it can be written as:

$$Gain(D, a) = Ent(D) - \sum_{v=1}^{|V|} \frac{|D^V|}{|D |} Ent(D^v) $$

Where $Ent(D^v)$ is the entropy of set points in D for which dimension $a$ is equal to $v$, $|D|$ is the total number of points in dataset $D$. $|D^V|$ is the total number of points in dataset $D$ for which dimension $a$ is equal to $v$.

For all aggregated dimensions, we calculate the information for each dimension. The greater the reduction in this uncertainty, the more information is gained about D from dimension $a$.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D should be in dollar signs?


#### Entropy Gain Ratio

Information gain is biased toward variables with large number of distinct values. A modification is [information gain ratio](https://en.wikipedia.org/wiki/Information_gain_ratio), which reduces its bias.

$$Ratio(D, a) = \frac{Gain(D,a)} {IV(a)} $$

where intrinsic value($IV$) is the entropy of split (with respect to dimension $a$ on focus).

$$IV(a) = -\sum_{v=1}^V\frac{|D^v|} {|D|} \log_2 \frac{|D^v|} {|D|} $$

In our strategy, firstly, for all the aggregated dimensions, we loop the dimension to find the dimension whose entropy gain is above mean entropy gain, then from the filtered dimensions, we select the dimension with highest entropy ratio as the best dimension. In the meanwhile, dimensions for which the anomaly value count is only one, we include it when calculation.

> [!Note]
> 1. As our algorithm depends on the data you input, so if the input points is incorrect or incomplete, the calculated result will be unexpected.
> 2. Currently, the algorithm localize the root cause incrementally, which means at most one dimension with the values are detected. If you want to find out all the dimensions that contribute to the anomaly, you can call this API recursively by updating the anomaly incident with the fixed dimension value.
6 changes: 6 additions & 0 deletions docs/api-reference/time-series-root-cause-surprise-score.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Surprise score is used to capture the relative change for the root cause item.
$$S_i(m) = 0.5( p_i\log_2(\frac{2p_i} {p_i+q_i}) + q_i \log_2(\frac{2q_i}{p_i+q_i}) )$$
$$p_i(m)= \frac{F_i(m)} {F(m)} $$
$$q_i(m)= \frac{A_i(m)} {A(m)} $$
where $F_i$ is the forecasted value for root cause item $i$, $A_i$ is the actual value for root cause item $i$, $F$ is the forecasted value for the anomly point and $A$ is the actual value for anomaly point.
For details of the surprise score, refer to [this document](https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-bhagwan.pdf)
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
using System;
using System.Collections.Generic;
using Microsoft.ML;
using Microsoft.ML.TimeSeries;

namespace Samples.Dynamic
{
public static class LocalizeRootCause
{
private static string AGG_SYMBOL = "##SUM##";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is AGG_SYMBOL for here? I notice that on line 19, you have both AGG_SYMBOL and AggregateType.Sum, and that some of the points have AGG_SYMBOL passed in instead of strings like "DC1."

Can you add a few comments explaining what AGG_SYMBOL is and why it is used?

public static void Example()
{
// Create a new ML context, for ML.NET operations. It can be used for
// exception tracking and logging, as well as the source of randomness.
var mlContext = new MLContext();

// Create an root cause localization input instance.
DateTime timestamp = GetTimestamp();
var data = new RootCauseLocalizationInput(timestamp, GetAnomalyDimension(), new List<MetricSlice>() { new MetricSlice(timestamp, GetPoints()) }, AggregateType.Sum, AGG_SYMBOL);

// Get the root cause localization result.
RootCause prediction = mlContext.AnomalyDetection.LocalizeRootCause(data);

// Print the localization result.
int count = 0;
foreach (RootCauseItem item in prediction.Items)
{
count++;
Console.WriteLine($"Root cause item #{count} ...");
Console.WriteLine($"Score: {item.Score}, Path: {String.Join(" ",item.Path)}, Direction: {item.Direction}, Dimension:{String.Join(" ", item.Dimension)}");
}

//Item #1 ...
//Score: 0.26670448876705927, Path: DataCenter, Direction: Up, Dimension:[Country, UK] [DeviceType, ##SUM##] [DataCenter, DC1]
}

private static List<Point> GetPoints()
{
List<Point> points = new List<Point>();

Dictionary<string, Object> dic1 = new Dictionary<string, Object>();
dic1.Add("Country", "UK");
dic1.Add("DeviceType", "Laptop");
dic1.Add("DataCenter", "DC1");
points.Add(new Point(200, 100, true, dic1));

Dictionary<string, Object> dic2 = new Dictionary<string, Object>();
dic2.Add("Country", "UK");
dic2.Add("DeviceType", "Mobile");
dic2.Add("DataCenter", "DC1");
points.Add(new Point(1000, 100, true, dic2));

Dictionary<string, Object> dic3 = new Dictionary<string, Object>();
dic3.Add("Country", "UK");
dic3.Add("DeviceType", AGG_SYMBOL);
dic3.Add("DataCenter", "DC1");
points.Add(new Point(1200, 200, true, dic3));

Dictionary<string, Object> dic4 = new Dictionary<string, Object>();
dic4.Add("Country", "UK");
dic4.Add("DeviceType", "Laptop");
dic4.Add("DataCenter", "DC2");
points.Add(new Point(100, 100, false, dic4));

Dictionary<string, Object> dic5 = new Dictionary<string, Object>();
dic5.Add("Country", "UK");
dic5.Add("DeviceType", "Mobile");
dic5.Add("DataCenter", "DC2");
points.Add(new Point(200, 200, false, dic5));

Dictionary<string, Object> dic6 = new Dictionary<string, Object>();
dic6.Add("Country", "UK");
dic6.Add("DeviceType", AGG_SYMBOL);
dic6.Add("DataCenter", "DC2");
points.Add(new Point(300, 300, false, dic6));

Dictionary<string, Object> dic7 = new Dictionary<string, Object>();
dic7.Add("Country", "UK");
dic7.Add("DeviceType", AGG_SYMBOL);
dic7.Add("DataCenter", AGG_SYMBOL);
points.Add(new Point(1500, 500, true, dic7));

Dictionary<string, Object> dic8 = new Dictionary<string, Object>();
dic8.Add("Country", "UK");
dic8.Add("DeviceType", "Laptop");
dic8.Add("DataCenter", AGG_SYMBOL);
points.Add(new Point(300, 200, true, dic8));

Dictionary<string, Object> dic9 = new Dictionary<string, Object>();
dic9.Add("Country", "UK");
dic9.Add("DeviceType", "Mobile");
dic9.Add("DataCenter", AGG_SYMBOL);
points.Add(new Point(1200, 300, true, dic9));

return points;
}

private static Dictionary<string, Object> GetAnomalyDimension()
{
Dictionary<string, Object> dim = new Dictionary<string, Object>();
dim.Add("Country", "UK");
dim.Add("DeviceType", AGG_SYMBOL);
dim.Add("DataCenter", AGG_SYMBOL);

return dim;
}

private static DateTime GetTimestamp()
{
return new DateTime(2020, 3, 23, 0, 0, 0);
}
}
}
50 changes: 49 additions & 1 deletion src/Microsoft.ML.TimeSeries/ExtensionsCatalog.cs
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using System;
using System.Reflection;
using Microsoft.ML.Data;
using Microsoft.ML.Runtime;
using Microsoft.ML.TimeSeries;
using Microsoft.ML.Transforms.TimeSeries;

namespace Microsoft.ML
Expand Down Expand Up @@ -143,9 +147,53 @@ public static SsaSpikeEstimator DetectSpikeBySsa(this TransformsCatalog catalog,
/// </format>
/// </example>
public static SrCnnAnomalyEstimator DetectAnomalyBySrCnn(this TransformsCatalog catalog, string outputColumnName, string inputColumnName,
int windowSize=64, int backAddWindowSize=5, int lookaheadWindowSize=5, int averageingWindowSize=3, int judgementWindowSize=21, double threshold=0.3)
int windowSize = 64, int backAddWindowSize = 5, int lookaheadWindowSize = 5, int averageingWindowSize = 3, int judgementWindowSize = 21, double threshold = 0.3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's spelled "averaging" (without an "e"). If this is an easy change to make, would be a good way to keep our code base looking high quality.

=> new SrCnnAnomalyEstimator(CatalogUtils.GetEnvironment(catalog), outputColumnName, windowSize, backAddWindowSize, lookaheadWindowSize, averageingWindowSize, judgementWindowSize, threshold, inputColumnName);

/// <summary>
/// Create <see cref="RootCause"/>, which localizes root causes using decision tree algorithm.
/// </summary>
/// <param name="catalog">The anomaly detection catalog.</param>
/// <param name="src">Root cause's input. The data is an instance of <see cref="Microsoft.ML.TimeSeries.RootCauseLocalizationInput"/>.</param>
/// <param name="beta">Beta is a weight parameter for user to choose. It is used when score is calculated for each root cause item. The range of beta should be in [0,1]. For a larger beta, root cause point which has a large difference between value and expected value will get a high score. On the contrary, for a small beta, root cause items which has a high relative change will get a high score.</param>
Copy link
Contributor

@gvashishtha gvashishtha May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say "on the contrary," but the two scenarios you describe don't seem to be opposites. One is about "relative change" and one is about difference between expected value and actual value. Can you make this explanation clearer?

Additionally, you mention "score," but it's not clear what score is, exactly.

/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[LocalizeRootCause](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/TimeSeries/LocalizeRootCause.cs)]
/// ]]>
/// </format>
/// </example>
public static RootCause LocalizeRootCause(this AnomalyDetectionCatalog catalog, RootCauseLocalizationInput src, double beta = 0.5)
{
IHostEnvironment host = CatalogUtils.GetEnvironment(catalog);

//check the root cause input
CheckRootCauseInput(host, src);

//check beta
host.CheckUserArg(beta >= 0 && beta <= 1, nameof(beta), "Must be in [0,1]");

//find out the root cause
RootCauseAnalyzer analyzer = new RootCauseAnalyzer(src, beta);
RootCause dst = analyzer.Analyze();
return dst;
}

private static void CheckRootCauseInput(IHostEnvironment host, RootCauseLocalizationInput src)
{
host.CheckUserArg(src.Slices.Count >= 1, nameof(src.Slices), "Must has more than one item");

bool containsAnomalyTimestamp = false;
foreach (MetricSlice slice in src.Slices)
{
if (slice.TimeStamp.Equals(src.AnomalyTimestamp))
{
containsAnomalyTimestamp = true;
}
}
host.CheckUserArg(containsAnomalyTimestamp, nameof(src.Slices), "Has no points in the given anomaly timestamp");
}

/// <summary>
/// Singular Spectrum Analysis (SSA) model for univariate time-series forecasting.
/// For the details of the model, refer to http://arxiv.org/pdf/1206.6910.pdf.
Expand Down
Loading