-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add root cause localization transformer (#4925)
* add root cause localization transformer * add test cases * revert sln changes * add evaluation * temp save for internal review * rename function * temp save bottom up points for switch desktop * update from laptop * save for add test * add root cause localization algorithm * add root cause localization algorithm * print score, path and directions in sample * merge with master * extract root cause analyzer * refine code * merge with master * add evaluatin * some refine for code * fix some typo * remove unused code * reformat code * updates * update from review * read double for beta * remove SignatureDataTransform constructor * update * update * remove white space * refine internal logic * update * update * updated test * update score * update variable name * add some comments * refine internal function * handle for infinity and nan * rename the algorithm by removing DT * Update src/Microsoft.ML.TimeSeries/RootCauseAnalyzer.cs Co-Authored-By: Justin Ormont <[email protected]> * fix type * add an else branch when delta is negative * update model signature * update rca interface by removing transformer * add more documents * update * update * update the constructor * update comments * fix typo Co-authored-by: Justin Ormont <[email protected]>
- Loading branch information
1 parent
6c30763
commit bc1fd86
Showing
7 changed files
with
1,306 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
At Mircosoft, we develop a decision tree based root cause localization method which helps to find out the root causes for an anomaly incident at a specific timestamp incrementally. | ||
|
||
## Multi-Dimensional Root Cause Localization | ||
It's a common case that one measure is collected with many dimensions (*e.g.*, Province, ISP) whose values are categorical(*e.g.*, Beijing or Shanghai for dimension Province). When a measure's value deviates from its expected value, this measure encounters anomalies. In such case, operators would like to localize the root cause dimension combinations rapidly and accurately. Multi-dimensional root cause localization is critical to troubleshoot and mitigate such case. | ||
|
||
## Algorithm | ||
|
||
The decision tree based root cause localization method is unsupervised, which means training step is not needed. It consists of the following major steps: | ||
|
||
(1) Find the best dimension which divides the anomalous and regular data based on decision tree according to entropy gain and entropy gain ratio. | ||
|
||
(2) Find the top anomaly points which contribute the most to anomaly incident given the selected best dimension. | ||
|
||
### Decision Tree | ||
|
||
[Decision tree](https://en.wikipedia.org/wiki/Decision_tree) algorithm chooses the highest information gain to split or construct a decision tree. We use it to choose the dimension which contributes the most to the anomaly. Following are some concepts used in decision tree. | ||
|
||
#### Information Entropy | ||
|
||
Information [entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) is a measure of disorder or uncertainty. You can think of it as a measure of purity as well. The less the value , the more pure of data D. | ||
|
||
$$Ent(D) = - \sum_{k=1}^{|y|} p_k\log_2(p_k) $$ | ||
|
||
where $p_k$ represents the probability of an element in dataset. In our case, there are only two classes, the anomalous points and the regular points. $|y|$ is the count of total anomalies. | ||
|
||
#### Information Gain | ||
[Information gain](https://en.wikipedia.org/wiki/Information_gain_in_decision_trees) is a metric to measure the reduction of this disorder in our target class given additional information about it. Mathematically it can be written as: | ||
|
||
$$Gain(D, a) = Ent(D) - \sum_{v=1}^{|V|} \frac{|D^V|}{|D |} Ent(D^v) $$ | ||
|
||
Where $Ent(D^v)$ is the entropy of set points in D for which dimension $a$ is equal to $v$, $|D|$ is the total number of points in dataset $D$. $|D^V|$ is the total number of points in dataset $D$ for which dimension $a$ is equal to $v$. | ||
|
||
For all aggregated dimensions, we calculate the information for each dimension. The greater the reduction in this uncertainty, the more information is gained about D from dimension $a$. | ||
|
||
#### Entropy Gain Ratio | ||
|
||
Information gain is biased toward variables with large number of distinct values. A modification is [information gain ratio](https://en.wikipedia.org/wiki/Information_gain_ratio), which reduces its bias. | ||
|
||
$$Ratio(D, a) = \frac{Gain(D,a)} {IV(a)} $$ | ||
|
||
where intrinsic value($IV$) is the entropy of split (with respect to dimension $a$ on focus). | ||
|
||
$$IV(a) = -\sum_{v=1}^V\frac{|D^v|} {|D|} \log_2 \frac{|D^v|} {|D|} $$ | ||
|
||
In our strategy, firstly, for all the aggregated dimensions, we loop the dimension to find the dimension whose entropy gain is above mean entropy gain, then from the filtered dimensions, we select the dimension with highest entropy ratio as the best dimension. In the meanwhile, dimensions for which the anomaly value count is only one, we include it when calculation. | ||
|
||
> [!Note] | ||
> 1. As our algorithm depends on the data you input, so if the input points is incorrect or incomplete, the calculated result will be unexpected. | ||
> 2. Currently, the algorithm localize the root cause incrementally, which means at most one dimension with the values are detected. If you want to find out all the dimensions that contribute to the anomaly, you can call this API recursively by updating the anomaly incident with the fixed dimension value. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
Surprise score is used to capture the relative change for the root cause item. | ||
$$S_i(m) = 0.5( p_i\log_2(\frac{2p_i} {p_i+q_i}) + q_i \log_2(\frac{2q_i}{p_i+q_i}) )$$ | ||
$$p_i(m)= \frac{F_i(m)} {F(m)} $$ | ||
$$q_i(m)= \frac{A_i(m)} {A(m)} $$ | ||
where $F_i$ is the forecasted value for root cause item $i$, $A_i$ is the actual value for root cause item $i$, $F$ is the forecasted value for the anomly point and $A$ is the actual value for anomaly point. | ||
For details of the surprise score, refer to [this document](https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-bhagwan.pdf) |
113 changes: 113 additions & 0 deletions
113
docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/TimeSeries/LocalizeRootCause.cs
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
using System; | ||
using System.Collections.Generic; | ||
using Microsoft.ML; | ||
using Microsoft.ML.TimeSeries; | ||
|
||
namespace Samples.Dynamic | ||
{ | ||
public static class LocalizeRootCause | ||
{ | ||
private static string AGG_SYMBOL = "##SUM##"; | ||
public static void Example() | ||
{ | ||
// Create a new ML context, for ML.NET operations. It can be used for | ||
// exception tracking and logging, as well as the source of randomness. | ||
var mlContext = new MLContext(); | ||
|
||
// Create an root cause localization input instance. | ||
DateTime timestamp = GetTimestamp(); | ||
var data = new RootCauseLocalizationInput(timestamp, GetAnomalyDimension(), new List<MetricSlice>() { new MetricSlice(timestamp, GetPoints()) }, AggregateType.Sum, AGG_SYMBOL); | ||
|
||
// Get the root cause localization result. | ||
RootCause prediction = mlContext.AnomalyDetection.LocalizeRootCause(data); | ||
|
||
// Print the localization result. | ||
int count = 0; | ||
foreach (RootCauseItem item in prediction.Items) | ||
{ | ||
count++; | ||
Console.WriteLine($"Root cause item #{count} ..."); | ||
Console.WriteLine($"Score: {item.Score}, Path: {String.Join(" ",item.Path)}, Direction: {item.Direction}, Dimension:{String.Join(" ", item.Dimension)}"); | ||
} | ||
|
||
//Item #1 ... | ||
//Score: 0.26670448876705927, Path: DataCenter, Direction: Up, Dimension:[Country, UK] [DeviceType, ##SUM##] [DataCenter, DC1] | ||
} | ||
|
||
private static List<Point> GetPoints() | ||
{ | ||
List<Point> points = new List<Point>(); | ||
|
||
Dictionary<string, Object> dic1 = new Dictionary<string, Object>(); | ||
dic1.Add("Country", "UK"); | ||
dic1.Add("DeviceType", "Laptop"); | ||
dic1.Add("DataCenter", "DC1"); | ||
points.Add(new Point(200, 100, true, dic1)); | ||
|
||
Dictionary<string, Object> dic2 = new Dictionary<string, Object>(); | ||
dic2.Add("Country", "UK"); | ||
dic2.Add("DeviceType", "Mobile"); | ||
dic2.Add("DataCenter", "DC1"); | ||
points.Add(new Point(1000, 100, true, dic2)); | ||
|
||
Dictionary<string, Object> dic3 = new Dictionary<string, Object>(); | ||
dic3.Add("Country", "UK"); | ||
dic3.Add("DeviceType", AGG_SYMBOL); | ||
dic3.Add("DataCenter", "DC1"); | ||
points.Add(new Point(1200, 200, true, dic3)); | ||
|
||
Dictionary<string, Object> dic4 = new Dictionary<string, Object>(); | ||
dic4.Add("Country", "UK"); | ||
dic4.Add("DeviceType", "Laptop"); | ||
dic4.Add("DataCenter", "DC2"); | ||
points.Add(new Point(100, 100, false, dic4)); | ||
|
||
Dictionary<string, Object> dic5 = new Dictionary<string, Object>(); | ||
dic5.Add("Country", "UK"); | ||
dic5.Add("DeviceType", "Mobile"); | ||
dic5.Add("DataCenter", "DC2"); | ||
points.Add(new Point(200, 200, false, dic5)); | ||
|
||
Dictionary<string, Object> dic6 = new Dictionary<string, Object>(); | ||
dic6.Add("Country", "UK"); | ||
dic6.Add("DeviceType", AGG_SYMBOL); | ||
dic6.Add("DataCenter", "DC2"); | ||
points.Add(new Point(300, 300, false, dic6)); | ||
|
||
Dictionary<string, Object> dic7 = new Dictionary<string, Object>(); | ||
dic7.Add("Country", "UK"); | ||
dic7.Add("DeviceType", AGG_SYMBOL); | ||
dic7.Add("DataCenter", AGG_SYMBOL); | ||
points.Add(new Point(1500, 500, true, dic7)); | ||
|
||
Dictionary<string, Object> dic8 = new Dictionary<string, Object>(); | ||
dic8.Add("Country", "UK"); | ||
dic8.Add("DeviceType", "Laptop"); | ||
dic8.Add("DataCenter", AGG_SYMBOL); | ||
points.Add(new Point(300, 200, true, dic8)); | ||
|
||
Dictionary<string, Object> dic9 = new Dictionary<string, Object>(); | ||
dic9.Add("Country", "UK"); | ||
dic9.Add("DeviceType", "Mobile"); | ||
dic9.Add("DataCenter", AGG_SYMBOL); | ||
points.Add(new Point(1200, 300, true, dic9)); | ||
|
||
return points; | ||
} | ||
|
||
private static Dictionary<string, Object> GetAnomalyDimension() | ||
{ | ||
Dictionary<string, Object> dim = new Dictionary<string, Object>(); | ||
dim.Add("Country", "UK"); | ||
dim.Add("DeviceType", AGG_SYMBOL); | ||
dim.Add("DataCenter", AGG_SYMBOL); | ||
|
||
return dim; | ||
} | ||
|
||
private static DateTime GetTimestamp() | ||
{ | ||
return new DateTime(2020, 3, 23, 0, 0, 0); | ||
} | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.