Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add root cause localization transformer #4925

Merged
merged 51 commits into from
May 11, 2020
Merged

add root cause localization transformer #4925

merged 51 commits into from
May 11, 2020

Conversation

suxi-ms
Copy link
Member

@suxi-ms suxi-ms commented Mar 10, 2020

The goal of this pull request is to provide a decision tree based algorithm to localize the root cause of an incident on multi-dimensional time series on a specified timestamp.

@suxi-ms suxi-ms requested a review from a team as a code owner March 10, 2020 01:30
@dnfclas
Copy link

dnfclas commented Mar 10, 2020

CLA assistant check
All CLA requirements met. #Resolved

Copy link
Contributor

@mstfbl mstfbl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @suxi-ms , thank you for your PR! Can you please briefly describe the changes you've made according to our contribution guide?

@frank-dong-ms-zz
Copy link
Contributor

frank-dong-ms-zz commented Mar 10, 2020

@suxi-ms, please do following thing then we can start code review:

  1. brief describe what you are going to do at the PR comments
  2. how many stages/PRs you plan to submit
  3. what is purpose of each PRs
  4. add reviewers that you need sign off from who have best knowledge for your change (you can find that from code history), you can always add harishsk to review as Harish is the manager of ML.NET
  5. always add necessary tests #Resolved

private static IRowMapper Create(IHostEnvironment env, ModelLoadContext ctx, DataViewSchema inputSchema)
=> Create(env, ctx).MakeRowMapper(inputSchema);

private protected override void SaveModel(ModelSaveContext ctx)
Copy link

@yaeldekel yaeldekel Mar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SaveModel [](start = 40, length = 9)

Don't you need to save (and load) _beta? #Resolved

Copy link
Member Author

@suxi-ms suxi-ms Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SaveModel [](start = 40, length = 9)

Don't you need to save (and load) _beta?

Thank you for the suggestion, I have added changeds to save _beta #Resolved

}

[Fact]
public void RootCauseLocalizationWithDT()
Copy link

@yaeldekel yaeldekel Mar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RootCauseLocalizationWithDT [](start = 20, length = 27)

Please also test serialization/deserialization. #Resolved

Copy link
Member Author

@suxi-ms suxi-ms Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RootCauseLocalizationWithDT [](start = 20, length = 27)

Please also test serialization/deserialization.

Could you help explain more, I did't exactly understand the suggestion. Which part needs serialization/deserialization? #Resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the DTRootCauseLocalizationTransformer class is methods that save the model to a file, and that load the model back from a file. This test should exercise these methods, by using the ml.Model.Save and ml.Model.Load APIs, and checking that the values returned by the deserialized model are the same as the ones returned by the original model.


In reply to: 399045379 [](ancestors = 399045379)

Copy link
Member Author

@suxi-ms suxi-ms Mar 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the DTRootCauseLocalizationTransformer class is methods that save the model to a file, and that load the model back from a file. This test should exercise these methods, by using the ml.Model.Save and ml.Model.Load APIs, and checking that the values returned by the deserialized model are the same as the ones returned by the original model.

In reply to: 399045379 [](ancestors = 399045379)

Do I need to save the model, I find in the document, the save menthod is used for training related data? #Resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was unclear. You should save the model, not the data. Using this API:
https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML.Data/Model/ModelOperationsCatalog.cs#L112
You then need to load the model using this API:
https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML.Data/Model/ModelOperationsCatalog.cs#L211
and then run the loaded model on the data, and verify that it gives the same outputs as the model before serialization.

Here is a test that contains an example of how to save and load the model:
https://github.com/dotnet/machinelearning/blob/master/test/Microsoft.ML.Tests/Transformers/NormalizerTests.cs#L961

(the saving and loading is done in lines 969 and 970).


In reply to: 399795392 [](ancestors = 399795392)

Copy link
Member Author

@suxi-ms suxi-ms Apr 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the DTRootCauseLocalizationTransformer class is methods that save the model to a file, and that load the model back from a file. This test should exercise these methods, by using the ml.Model.Save and ml.Model.Load APIs, and checking that the values returned by the deserialized model are the same as the ones returned by the original model.

In reply to: 399045379 [](ancestors = 399045379)

I have tried to use this logic to save and load model,

        var modelPath = "temp.zip";
        //Save model to a file
        ml.Model.Save(model, data.Schema, modelPath);

        //Load model from a file
        ITransformer serializedModel;
        using (var file = File.OpenRead(modelPath))
        {
            serializedModel = ml.Model.Load(file, out var serializedSchema);
            TestCommon.CheckSameSchemas(data.Schema, serializedSchema);
        }

However, the Save() method will hit an exception because Save.IsColumnSavable(type) returns false. The root cause is we have a customized dataviewtype, which is neither VectorDataViewType nor PrimitiveDataViewType, the check failed so no columns is added for save. Do you have any guidance on this? #Resolved

Copy link
Member Author

@suxi-ms suxi-ms Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the DTRootCauseLocalizationTransformer class is methods that save the model to a file, and that load the model back from a file. This test should exercise these methods, by using the ml.Model.Save and ml.Model.Load APIs, and checking that the values returned by the deserialized model are the same as the ones returned by the original model.
In reply to: 399045379 [](ancestors = 399045379)

I have tried to use this logic to save and load model,

        var modelPath = "temp.zip";
        //Save model to a file
        ml.Model.Save(model, data.Schema, modelPath);

        //Load model from a file
        ITransformer serializedModel;
        using (var file = File.OpenRead(modelPath))
        {
            serializedModel = ml.Model.Load(file, out var serializedSchema);
            TestCommon.CheckSameSchemas(data.Schema, serializedSchema);
        }

However, the Save() method will hit an exception because Save.IsColumnSavable(type) returns false. The root cause is we have a customized dataviewtype, which is neither VectorDataViewType nor PrimitiveDataViewType, the check failed so no columns is added for save. Do you have any guidance on this?

Has updated according to your suggestion in email discussion #Resolved

var src = default(RootCauseLocalizationInput);
var getSrc = input.GetGetter<RootCauseLocalizationInput>(input.Schema[ColMapNewToOld[iinfo]]);

disposer =
Copy link

@yaeldekel yaeldekel Mar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disposer [](start = 16, length = 8)

I think this is not necessary. This is used by components that use resources that are IDisposable, for example, ImageResizer that uses a Bitmap for its computations. #Resolved

Copy link
Member Author

@suxi-ms suxi-ms Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disposer [](start = 16, length = 8)

I think this is not necessary. This is used by components that use resources that are IDisposable, for example, ImageResizer that uses a Bitmap for its computations.

Have changed, please help review whether the change is right #Resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also assign null instead.


In reply to: 399045482 [](ancestors = 399045482)

Copy link
Member Author

@suxi-ms suxi-ms Apr 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also assign null instead.

In reply to: 399045482 [](ancestors = 399045482)

Have updated to null #Resolved

/// <param name="columns">The name of the columns (first item of the tuple), and the name of the resulting output column (second item of the tuple).</param>
/// <param name="beta">The weight for generating score in output result.</param>
[BestFriend]
internal DTRootCauseLocalizationEstimator(IHostEnvironment env, double beta = Defaults.Beta, params (string outputColumnName, string inputColumnName)[] columns)
Copy link

@yaeldekel yaeldekel Mar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[] [](start = 157, length = 2)

Since this transformer is non-trainable (meaning, the estimator doesn't need to look at the data in order to create a transformer), there is no advantage to supporting multiple input/output columns here - if there is ever a scenario where there are multiple input columns, the user can create an estimator chain instead. #Resolved

Copy link
Member Author

@suxi-ms suxi-ms Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[] [](start = 157, length = 2)

Since this transformer is non-trainable (meaning, the estimator doesn't need to look at the data in order to create a transformer), there is no advantage to supporting multiple input/output columns here - if there is ever a scenario where there are multiple input columns, the user can create an estimator chain instead.

I find that ImageLoading and TextNormalizing are also non-trainable, however they defines this columns parameter, could you help explain when to use it and when not? #Resolved

Copy link
Member Author

@suxi-ms suxi-ms Apr 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[] [](start = 157, length = 2)

Since this transformer is non-trainable (meaning, the estimator doesn't need to look at the data in order to create a transformer), there is no advantage to supporting multiple input/output columns here - if there is ever a scenario where there are multiple input columns, the user can create an estimator chain instead.

I find that ImageLoading and TextNormalizing are also non-trainable, however they defines this columns parameter, could you help explain when to use it and when not?

I have updated the columns to outputColumnand inputColumn, could you help check whether they are right? #Resolved

_beta = beta;
}

// Factory method for SignatureDataTransform.
Copy link

@yaeldekel yaeldekel Mar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SignatureDataTransform [](start = 30, length = 22)

This (and the matching attribute) is only needed if you want to use this component from the command line. If this is the intention, please add a test for running this using maml. #Resolved

Copy link
Member Author

@suxi-ms suxi-ms Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SignatureDataTransform [](start = 30, length = 22)

This (and the matching attribute) is only needed if you want to use this component from the command line. If this is the intention, please add a test for running this using maml.

I have no idea about the maml, could you help provide more information? #Resolved

Copy link
Member Author

@suxi-ms suxi-ms Apr 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SignatureDataTransform [](start = 30, length = 22)

This (and the matching attribute) is only needed if you want to use this component from the command line. If this is the intention, please add a test for running this using maml.

I have no idea about the maml, could you help provide more information?

maml is not needed and I have removed this method #Resolved

if (!(col.ItemType is RootCauseLocalizationInputDataViewType) || col.Kind != SchemaShape.Column.VectorKind.Scalar)
throw Host.ExceptSchemaMismatch(nameof(inputSchema), "input", colInfo.inputColumnName, new RootCauseLocalizationInputDataViewType().ToString(), col.GetTypeString());

result[colInfo.outputColumnName] = new SchemaShape.Column(colInfo.outputColumnName, col.Kind, col.ItemType, col.IsKey, col.Annotations);
Copy link

@yaeldekel yaeldekel Mar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

col.Kind, col.ItemType, col.IsKey, col.Annotations [](start = 100, length = 50)

These should be the values for the output column. The Kind should be SchemaShape.Column.VectorKind.Scalar (this should match col.Kind since you are checking above that it is a scalar), but what should the ItemType be? Also, seems like col.IsKey is always false, and are there any annotations that need to be passed from the input to the output? #Resolved

Copy link
Member Author

@suxi-ms suxi-ms Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

col.Kind, col.ItemType, col.IsKey, col.Annotations [](start = 100, length = 50)

These should be the values for the output column. The Kind should be SchemaShape.Column.VectorKind.Scalar (this should match col.Kind since you are checking above that it is a scalar), but what should the ItemType be? Also, seems like col.IsKey is always false, and are there any annotations that need to be passed from the input to the output?

made some udpates, could you help review whether they are right? #Resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this check removed - col.Kind != SchemaShape.Column.VectorKind.Scalar? Isn't it correct?


In reply to: 399045854 [](ancestors = 399045854)

Copy link
Member Author

@suxi-ms suxi-ms Mar 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha

Why was this check removed - col.Kind != SchemaShape.Column.VectorKind.Scalar? Isn't it correct?

In reply to: 399045854 [](ancestors = 399045854)

Have updated. #Resolved

@@ -146,6 +146,23 @@ public static class TimeSeriesCatalog
int windowSize=64, int backAddWindowSize=5, int lookaheadWindowSize=5, int averageingWindowSize=3, int judgementWindowSize=21, double threshold=0.3)
=> new SrCnnAnomalyEstimator(CatalogUtils.GetEnvironment(catalog), outputColumnName, windowSize, backAddWindowSize, lookaheadWindowSize, averageingWindowSize, judgementWindowSize, threshold, inputColumnName);

/// <summary>
/// Create <see cref="DTRootCauseLocalizationEstimator"/>, which localizes root causess using decision tree algorithm.
Copy link
Member

@ganik ganik Mar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

causess [](start = 88, length = 7)

typo #Resolved

Copy link
Member Author

@suxi-ms suxi-ms Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

causess [](start = 88, length = 7)

typo

Thank you for pointing it out, has fixed it. #Resolved

return subDim;
}

protected List<RootCauseItem> LocalizeRootCauseByDimension(PointTree anomalyTree, PointTree pointTree, Dictionary<string, Object> anomalyDimension, List<string> aggDims)
Copy link
Contributor

@harishsk harishsk May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Please consider something along the lines of the following for the GetSubDim function.

return new Dictionary<string, object>(keyList.Select(dim => new KeyValuePair<string, object>(dim, dimension[dim])));

#Resolved

Copy link
Member Author

@suxi-ms suxi-ms May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Please consider something along the lines of the following for the GetSubDim function.

return new Dictionary<string, object>(keyList.Select(dim => new KeyValuePair<string, object>(dim, dimension[dim])));

updated #Resolved

}

private double Log2(double val)
{
Copy link
Contributor

@harishsk harishsk May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider using the following attribute for the function:

        [MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
``` #Resolved

Copy link
Member Author

@suxi-ms suxi-ms May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider using the following attribute for the function:

        [MethodImplAttribute(MethodImplOptions.AggressiveInlining)]

updated #Resolved

if (dimension.Key.AnomalyDis.Count > 1)
{
if (best == null || (!Double.IsNaN(valueRatioMap[best]) && (best.AnomalyDis.Count != 1 && (isLeavesLevel ? valueRatioMap[best].CompareTo(dimension.Value) <= 0 : valueRatioMap[best].CompareTo(dimension.Value) >= 0))))
{
Copy link
Contributor

@harishsk harishsk May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: For readability, can you please avoid the long line and break up the conditions on separate lines? Same comment for line 487. #Resolved

Copy link
Member Author

@suxi-ms suxi-ms May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: For readability, can you please avoid the long line and break up the conditions on separate lines? Same comment for line 487.

made some updates #Resolved


//check beta
if (beta < 0 || beta > 1) {
host.CheckUserArg(beta >= 0 && beta <= 1, nameof(beta), "Must be in [0,1]");
Copy link
Contributor

@harishsk harishsk May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this if check. The CheckUserArg is performing the check. #Resolved

Copy link
Member Author

@suxi-ms suxi-ms May 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this if check. The CheckUserArg is performing the check.

will update #Resolved

Copy link
Contributor

@harishsk harishsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@harishsk harishsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@harishsk harishsk merged commit bc1fd86 into dotnet:master May 11, 2020
@@ -0,0 +1,49 @@
At Mircosoft, we develop a decision tree based root cause localization method which helps to find out the root causes for an anomaly incident at a specific timestamp incrementally.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, "Microsoft." Also, it's a bit nonstandard to use present tense for "we develop." I would expect "we have developed" if the work is completed or "we maintain" if the work is ongoing.

At Mircosoft, we develop a decision tree based root cause localization method which helps to find out the root causes for an anomaly incident at a specific timestamp incrementally.

## Multi-Dimensional Root Cause Localization
It's a common case that one measure is collected with many dimensions (*e.g.*, Province, ISP) whose values are categorical(*e.g.*, Beijing or Shanghai for dimension Province). When a measure's value deviates from its expected value, this measure encounters anomalies. In such case, operators would like to localize the root cause dimension combinations rapidly and accurately. Multi-dimensional root cause localization is critical to troubleshoot and mitigate such case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use "users" instead of "operators."


### Decision Tree

[Decision tree](https://en.wikipedia.org/wiki/Decision_tree) algorithm chooses the highest information gain to split or construct a decision tree.  We use it to choose the dimension which contributes the most to the anomaly. Following are some concepts used in decision tree.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's non-standard to omit articles here. Try something like "The Decision Tree algorithm chooses..." and "Below are some concepts used in decision trees"


Where $Ent(D^v)$ is the entropy of set points in D for which dimension $a$ is equal to $v$, $|D|$ is the total number of points in dataset $D$. $|D^V|$ is the total number of points in dataset $D$ for which dimension $a$ is equal to $v$.

For all aggregated dimensions, we calculate the information for each dimension. The greater the reduction in this uncertainty, the more information is gained about D from dimension $a$.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D should be in dollar signs?

{
public static class LocalizeRootCause
{
private static string AGG_SYMBOL = "##SUM##";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is AGG_SYMBOL for here? I notice that on line 19, you have both AGG_SYMBOL and AggregateType.Sum, and that some of the points have AGG_SYMBOL passed in instead of strings like "DC1."

Can you add a few comments explaining what AGG_SYMBOL is and why it is used?

@@ -143,9 +147,53 @@ public static class TimeSeriesCatalog
/// </format>
/// </example>
public static SrCnnAnomalyEstimator DetectAnomalyBySrCnn(this TransformsCatalog catalog, string outputColumnName, string inputColumnName,
int windowSize=64, int backAddWindowSize=5, int lookaheadWindowSize=5, int averageingWindowSize=3, int judgementWindowSize=21, double threshold=0.3)
int windowSize = 64, int backAddWindowSize = 5, int lookaheadWindowSize = 5, int averageingWindowSize = 3, int judgementWindowSize = 21, double threshold = 0.3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's spelled "averaging" (without an "e"). If this is an easy change to make, would be a good way to keep our code base looking high quality.

/// </summary>
/// <param name="catalog">The anomaly detection catalog.</param>
/// <param name="src">Root cause's input. The data is an instance of <see cref="Microsoft.ML.TimeSeries.RootCauseLocalizationInput"/>.</param>
/// <param name="beta">Beta is a weight parameter for user to choose. It is used when score is calculated for each root cause item. The range of beta should be in [0,1]. For a larger beta, root cause point which has a large difference between value and expected value will get a high score. On the contrary, for a small beta, root cause items which has a high relative change will get a high score.</param>
Copy link
Contributor

@gvashishtha gvashishtha May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say "on the contrary," but the two scenarios you describe don't seem to be opposites. One is about "relative change" and one is about difference between expected value and actual value. Can you make this explanation clearer?

Additionally, you mention "score," but it's not clear what score is, exactly.

@suxi-ms suxi-ms mentioned this pull request Jun 2, 2020
4 tasks
@suxi-ms suxi-ms mentioned this pull request Jun 9, 2020
4 tasks
@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add root cause localization algorithm