False alarm categories and solutions #165

tylergu · 2022-08-17T01:51:00Z

This issue is to document the false alarm categories we encountered when applying Acto to the operators, and the solutions we use to reduce the false alarms

1. Field doesn't map to system state

Example

The field ['spec', 'allowUnsafeConfigurations'] configures the operator to allow some unsafe configuration options. When changing this field, although it changes the operator behavior, the value of the field does not map a field in the system state.

Solution

We use taint analysis to find the fields that do not directly map to a system state.
We first find the mapping between the CR fields and the variables in the program source code. Then we conduct taint analysis on all the program variables and see if the variables taint the k8s client calls. If it does not taint, then it means the value of this field does not show up in the system state. We run the taint analysis and collect all the fields which do not flow into k8s client calls, and ignore them when running the oracle.

2. Field dependency

Example

The field ['spec', 'replsets', 0, 'storage', 'inMemory', 'engineConfig', 'inMemorySizeRatio'] is only effective if the field ['spec', 'replsets', 0, 'storage', 'storageType'] is inMemory.

Solution

We use dependency analysis to figure out the dependency information
We reuse the mapping between the CR fields and the program variables from previous taint analysis. Then we run dominator and post-dominator analysis to find the dominating if branches and their dominees. We then find the program variables that are dominated by the if branch conditions. Here, we only focus on the simple direct if conditions because we have to resolve the if condition, e.g.

if spec.Condition == some constant:
    // do something

we are not able to resolve conditions from complex function calls, e.g.

if version.CompareVersion(spec.version, "1.12.0") < 0:
    // do something

3. Format

Example

Some fields are mapped to system state, but their format is changed. For example, the field in CR spec.resources.request.cpu is set to 0.1, in the system state it will get formatted to 100m.

Solution

We try to first convert the values to the same format, then do the comparison
We use the same utility function from k8s library which is used for formatting. We use the utility function to convert the field in CR and the field in system state to use the same format, and then do the comparison.

4. Ineffective input generation

Example

Acto sometimes generates ineffective input, changing a field from null to its default value.
For example, the field spec.replicas is changed from null to 2. Since the operator will use the default value if the field is set up null, changing the field from null to 2 causes no change.

Solution

We run static program analysis to extract the default values for fields. We use the pattern:

if cr.Spec.ImagePullPolicy == "" {
    cr.Spec.ImagePullPolicy = defaultImagePullPolicy
}

where the defaultImagePullPolicy is essentially a constant.

5. Invalid input

Example

Acto generates invalid input, which gets rejected by the operator and causes no effect on the system state. However, Acto expects a corresponding change, thus reports a false alarm.
For example, when change the field spec.secretName to some random string, the operator finds that there is no secret with the corresponding name, thus rejecting the input. Acto expects a change so it reports a false alarm

Solution

We try to recognize if an input is invalid or not by collecting feedback from the operator.
We capture the error messages in operator log and analyze them. We check if the error message contains the field name of the input change, or the value of the input change. If it does, then Acto determines that the previous input is invalid and skip running the oracle.

6. null != default

Example

Acto sometimes changes a field from a value to null, however, the operator can have a default value for the field when it is null. When doing the value comparison in oracle, Acto thinks null != default value, reporting false alarm.
e.g. The field spec.image is changed from rabbitmq:1.2 to null, but the corresponding system state is changed from rabbitmq:1.2 to rabbitmq:1.1 where rabbitmq:1.1 is the default value. Since Acto thinks null != rabbitmq:1.1, it reports alarm

Solution

When doing the comparison, Acto regards null values as wild card and can be matched with any value.

7. string matching

Example

Sometimes the input field value appears as a substring in the system state.
e.g. The value of the field ['spec', 'terminationGracePeriodSeconds'] appear in the container startup command

"command": [
      "/bin/bash",
      "-c",
      "if [ ! -z \"$(cat /etc/pod-info/skipPreStopChecks)\" ]; then exit 0; fi; rabbitmq-upgrade await_online_quorum_plus_one -t 1024; rabbitmq-upgrade await_online_synchronized_mirror -t 1024; rabbitmq-upgrade drain -t 1024"
]

where the terminationGracePeriodSeconds == 1024

Solution

We also do substring matching if the value is not too simple (e,g. 1).

8. Wrong matching

Example

Acto does matching by comparing the field name. It finds the longest matching field in the system state. For example, spec.image.tag would be matched with pod.template.container[0].image.tag, because these two fields have matching length of 2. But sometimes the input field name is too general and gets matched with the wrong field in the system state.

Solution

There are only few very generic field names, e.g. name, selector. We avoid using field matching for these field, instead, we directly try to find a field in system state which has the same prev and curr value with the input field.

9. Transient/Benign Error log

Example

Acto reports alarm if it sees an error log in the operator log. However, this causes a lot of false alarms because most of the error logs in the operator log are benign or transient. For example, at the beginning of the cluster formation, the operator cannot connect to the cluster, it will write connection refused error. However, this error is transient and does not indicate actual problem

Solution

We removed this error log oracle, since it hasn't been able to find any real bugs.

10. Application internal

Example

Some CR fields do have effect on the application state, but the state is not reflected on the Kubernetes level. For example, the redis-operator communicate with the redis cluster directly using a redis client library. It directly use client calls to configure the configuration values. These configuration changes are not reflected on the Kubernetes level, so Acto cannot capture them.

Solution

This problem is solved as a byproduct as the analysis we do for ## 1. Field doesn't map to system state. Since these fields do not flow into k8s client calls, they will be ignored when running oracle.

11. deployment restarts pod with different name

Example

Deployment is a kind of resource in Kubernetes which is usually used for stateless applications. The pods under Deployment should be used interchangeably, thus they do not have an order. The pod names have randomly generated hash, if the pod is restarted, it will have a different pod name. This makes the delta look weird. For example, when Acto changes the CR field spec.image from image:1 to image:2, the operator will restart the pods with the updated image. The system state will show a pod is deleted, and a new pod is created. However, since the pod names are different now, we will see one pod's image is changed from image:1 to null, and a new pod's image is changed from null to image:2. This makes it for Acto to do the matching.

Solution

We treat every pod in Deployment equally, and match them by index rather than name.

The text was updated successfully, but these errors were encountered:

tianyin · 2022-08-17T03:31:15Z

Thanks for writing up the categories.

I think the key question is the quantitate results, which essentially will be our evaluation section. I don't have a good picture about the FP rate for each operator :( and how we plan to reduce some of them if they are high numbers.

tylergu · 2022-08-17T03:55:24Z

We are updating the aggregated alarm sheet, I have finished the rabbitmq, Cassandra, zookeeper, percona-mongodb, redis operators, I will help Yuxuan and Kunle since they are traveling recently.

I think for the evaluation section, we will need the number of false alarms in each category without any solutions, and number of false alarms with each solution applied. Right now the static analysis is always used after all tests have finished, so it is easy to compare the number of false alarms with/without static analysis, but some of the solutions are hard coded in the oracle and applied everytime. I will modify Acto to make these solutions configurable so that we can run Acto in different settings to compare the numbers

tianyin · 2022-08-17T03:59:33Z

Look forward to the sheet.

I will help Yuxuan and Kunle since they are traveling recently.

Thanks, @tylergu !

we will need the number of false alarms in each category without any solutions, and number of false alarms with each solution applied

Exactly. And, the former is even more important than the latter.

Some of the recent numbers as reported in #166 is quite worrisome and we should look into it more.

tylergu added the documentation Improvements or additions to documentation label Aug 17, 2022

tianyin closed this as completed Jul 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

False alarm categories and solutions #165

False alarm categories and solutions #165

tylergu commented Aug 17, 2022 •

edited

Loading

tianyin commented Aug 17, 2022

tylergu commented Aug 17, 2022

tianyin commented Aug 17, 2022

False alarm categories and solutions #165

False alarm categories and solutions #165

Comments

tylergu commented Aug 17, 2022 • edited Loading

1. Field doesn't map to system state

Example

Solution

2. Field dependency

Example

Solution

3. Format

Example

Solution

4. Ineffective input generation

Example

Solution

5. Invalid input

Example

Solution

6. null != default

Example

Solution

7. string matching

Example

Solution

8. Wrong matching

Example

Solution

9. Transient/Benign Error log

Example

Solution

10. Application internal

Example

Solution

11. deployment restarts pod with different name

Example

Solution

tianyin commented Aug 17, 2022

tylergu commented Aug 17, 2022

tianyin commented Aug 17, 2022

tylergu commented Aug 17, 2022 •

edited

Loading