You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is to document the false alarm categories we encountered when applying Acto to the operators, and the solutions we use to reduce the false alarms
1. Field doesn't map to system state
Example
The field ['spec', 'allowUnsafeConfigurations'] configures the operator to allow some unsafe configuration options. When changing this field, although it changes the operator behavior, the value of the field does not map a field in the system state.
Solution
We use taint analysis to find the fields that do not directly map to a system state.
We first find the mapping between the CR fields and the variables in the program source code. Then we conduct taint analysis on all the program variables and see if the variables taint the k8s client calls. If it does not taint, then it means the value of this field does not show up in the system state. We run the taint analysis and collect all the fields which do not flow into k8s client calls, and ignore them when running the oracle.
2. Field dependency
Example
The field ['spec', 'replsets', 0, 'storage', 'inMemory', 'engineConfig', 'inMemorySizeRatio'] is only effective if the field ['spec', 'replsets', 0, 'storage', 'storageType'] is inMemory.
Solution
We use dependency analysis to figure out the dependency information
We reuse the mapping between the CR fields and the program variables from previous taint analysis. Then we run dominator and post-dominator analysis to find the dominating if branches and their dominees. We then find the program variables that are dominated by the if branch conditions. Here, we only focus on the simple direct if conditions because we have to resolve the if condition, e.g.
if spec.Condition == some constant:
// do something
we are not able to resolve conditions from complex function calls, e.g.
if version.CompareVersion(spec.version, "1.12.0") < 0:
// do something
3. Format
Example
Some fields are mapped to system state, but their format is changed. For example, the field in CR spec.resources.request.cpu is set to 0.1, in the system state it will get formatted to 100m.
Solution
We try to first convert the values to the same format, then do the comparison
We use the same utility function from k8s library which is used for formatting. We use the utility function to convert the field in CR and the field in system state to use the same format, and then do the comparison.
4. Ineffective input generation
Example
Acto sometimes generates ineffective input, changing a field from null to its default value.
For example, the field spec.replicas is changed from null to 2. Since the operator will use the default value if the field is set up null, changing the field from null to 2 causes no change.
Solution
We run static program analysis to extract the default values for fields. We use the pattern:
where the defaultImagePullPolicy is essentially a constant.
5. Invalid input
Example
Acto generates invalid input, which gets rejected by the operator and causes no effect on the system state. However, Acto expects a corresponding change, thus reports a false alarm.
For example, when change the field spec.secretName to some random string, the operator finds that there is no secret with the corresponding name, thus rejecting the input. Acto expects a change so it reports a false alarm
Solution
We try to recognize if an input is invalid or not by collecting feedback from the operator.
We capture the error messages in operator log and analyze them. We check if the error message contains the field name of the input change, or the value of the input change. If it does, then Acto determines that the previous input is invalid and skip running the oracle.
6. null != default
Example
Acto sometimes changes a field from a value to null, however, the operator can have a default value for the field when it is null. When doing the value comparison in oracle, Acto thinks null != default value, reporting false alarm.
e.g. The field spec.image is changed from rabbitmq:1.2 to null, but the corresponding system state is changed from rabbitmq:1.2 to rabbitmq:1.1 where rabbitmq:1.1 is the default value. Since Acto thinks null != rabbitmq:1.1, it reports alarm
Solution
When doing the comparison, Acto regards null values as wild card and can be matched with any value.
7. string matching
Example
Sometimes the input field value appears as a substring in the system state.
e.g. The value of the field ['spec', 'terminationGracePeriodSeconds'] appear in the container startup command
We also do substring matching if the value is not too simple (e,g. 1).
8. Wrong matching
Example
Acto does matching by comparing the field name. It finds the longest matching field in the system state. For example, spec.image.tag would be matched with pod.template.container[0].image.tag, because these two fields have matching length of 2. But sometimes the input field name is too general and gets matched with the wrong field in the system state.
Solution
There are only few very generic field names, e.g. name, selector. We avoid using field matching for these field, instead, we directly try to find a field in system state which has the same prev and curr value with the input field.
9. Transient/Benign Error log
Example
Acto reports alarm if it sees an error log in the operator log. However, this causes a lot of false alarms because most of the error logs in the operator log are benign or transient. For example, at the beginning of the cluster formation, the operator cannot connect to the cluster, it will write connection refused error. However, this error is transient and does not indicate actual problem
Solution
We removed this error log oracle, since it hasn't been able to find any real bugs.
10. Application internal
Example
Some CR fields do have effect on the application state, but the state is not reflected on the Kubernetes level. For example, the redis-operator communicate with the redis cluster directly using a redis client library. It directly use client calls to configure the configuration values. These configuration changes are not reflected on the Kubernetes level, so Acto cannot capture them.
Solution
This problem is solved as a byproduct as the analysis we do for ## 1. Field doesn't map to system state. Since these fields do not flow into k8s client calls, they will be ignored when running oracle.
11. deployment restarts pod with different name
Example
Deployment is a kind of resource in Kubernetes which is usually used for stateless applications. The pods under Deployment should be used interchangeably, thus they do not have an order. The pod names have randomly generated hash, if the pod is restarted, it will have a different pod name. This makes the delta look weird. For example, when Acto changes the CR field spec.image from image:1 to image:2, the operator will restart the pods with the updated image. The system state will show a pod is deleted, and a new pod is created. However, since the pod names are different now, we will see one pod's image is changed from image:1 to null, and a new pod's image is changed from null to image:2. This makes it for Acto to do the matching.
Solution
We treat every pod in Deployment equally, and match them by index rather than name.
The text was updated successfully, but these errors were encountered:
I think the key question is the quantitate results, which essentially will be our evaluation section. I don't have a good picture about the FP rate for each operator :( and how we plan to reduce some of them if they are high numbers.
We are updating the aggregated alarm sheet, I have finished the rabbitmq, Cassandra, zookeeper, percona-mongodb, redis operators, I will help Yuxuan and Kunle since they are traveling recently.
I think for the evaluation section, we will need the number of false alarms in each category without any solutions, and number of false alarms with each solution applied. Right now the static analysis is always used after all tests have finished, so it is easy to compare the number of false alarms with/without static analysis, but some of the solutions are hard coded in the oracle and applied everytime. I will modify Acto to make these solutions configurable so that we can run Acto in different settings to compare the numbers
This issue is to document the false alarm categories we encountered when applying Acto to the operators, and the solutions we use to reduce the false alarms
1. Field doesn't map to system state
Example
The field
['spec', 'allowUnsafeConfigurations']
configures the operator to allow some unsafe configuration options. When changing this field, although it changes the operator behavior, the value of the field does not map a field in the system state.Solution
We use taint analysis to find the fields that do not directly map to a system state.
We first find the mapping between the CR fields and the variables in the program source code. Then we conduct taint analysis on all the program variables and see if the variables taint the k8s client calls. If it does not taint, then it means the value of this field does not show up in the system state. We run the taint analysis and collect all the fields which do not flow into k8s client calls, and ignore them when running the oracle.
2. Field dependency
Example
The field
['spec', 'replsets', 0, 'storage', 'inMemory', 'engineConfig', 'inMemorySizeRatio']
is only effective if the field['spec', 'replsets', 0, 'storage', 'storageType']
isinMemory
.Solution
We use dependency analysis to figure out the dependency information
We reuse the mapping between the CR fields and the program variables from previous taint analysis. Then we run dominator and post-dominator analysis to find the dominating if branches and their dominees. We then find the program variables that are dominated by the if branch conditions. Here, we only focus on the simple direct if conditions because we have to resolve the if condition, e.g.
we are not able to resolve conditions from complex function calls, e.g.
3. Format
Example
Some fields are mapped to system state, but their format is changed. For example, the field in CR
spec.resources.request.cpu
is set to0.1
, in the system state it will get formatted to100m
.Solution
We try to first convert the values to the same format, then do the comparison
We use the same utility function from k8s library which is used for formatting. We use the utility function to convert the field in CR and the field in system state to use the same format, and then do the comparison.
4. Ineffective input generation
Example
Acto sometimes generates ineffective input, changing a field from
null
to its default value.For example, the field
spec.replicas
is changed fromnull
to2
. Since the operator will use the default value if the field is set upnull
, changing the field fromnull
to2
causes no change.Solution
We run static program analysis to extract the default values for fields. We use the pattern:
where the
defaultImagePullPolicy
is essentially a constant.5. Invalid input
Example
Acto generates invalid input, which gets rejected by the operator and causes no effect on the system state. However, Acto expects a corresponding change, thus reports a false alarm.
For example, when change the field
spec.secretName
to some random string, the operator finds that there is no secret with the corresponding name, thus rejecting the input. Acto expects a change so it reports a false alarmSolution
We try to recognize if an input is invalid or not by collecting feedback from the operator.
We capture the error messages in operator log and analyze them. We check if the error message contains the field name of the input change, or the value of the input change. If it does, then Acto determines that the previous input is invalid and skip running the oracle.
6. null != default
Example
Acto sometimes changes a field from a value to
null
, however, the operator can have a default value for the field when it is null. When doing the value comparison in oracle, Acto thinksnull
!= default value, reporting false alarm.e.g. The field
spec.image
is changed fromrabbitmq:1.2
tonull
, but the corresponding system state is changed fromrabbitmq:1.2
torabbitmq:1.1
whererabbitmq:1.1
is the default value. Since Acto thinksnull
!=rabbitmq:1.1
, it reports alarmSolution
When doing the comparison, Acto regards
null
values as wild card and can be matched with any value.7. string matching
Example
Sometimes the input field value appears as a substring in the system state.
e.g. The value of the field
['spec', 'terminationGracePeriodSeconds']
appear in the container startup commandwhere the
terminationGracePeriodSeconds
==1024
Solution
We also do substring matching if the value is not too simple (e,g.
1
).8. Wrong matching
Example
Acto does matching by comparing the field name. It finds the longest matching field in the system state. For example,
spec.image.tag
would be matched withpod.template.container[0].image.tag
, because these two fields have matching length of 2. But sometimes the input field name is too general and gets matched with the wrong field in the system state.Solution
There are only few very generic field names, e.g.
name
,selector
. We avoid using field matching for these field, instead, we directly try to find a field in system state which has the sameprev
andcurr
value with the input field.9. Transient/Benign Error log
Example
Acto reports alarm if it sees an error log in the operator log. However, this causes a lot of false alarms because most of the error logs in the operator log are benign or transient. For example, at the beginning of the cluster formation, the operator cannot connect to the cluster, it will write
connection refused
error. However, this error is transient and does not indicate actual problemSolution
We removed this error log oracle, since it hasn't been able to find any real bugs.
10. Application internal
Example
Some CR fields do have effect on the application state, but the state is not reflected on the Kubernetes level. For example, the redis-operator communicate with the redis cluster directly using a redis client library. It directly use client calls to configure the configuration values. These configuration changes are not reflected on the Kubernetes level, so Acto cannot capture them.
Solution
This problem is solved as a byproduct as the analysis we do for ## 1. Field doesn't map to system state. Since these fields do not flow into k8s client calls, they will be ignored when running oracle.
11. deployment restarts pod with different name
Example
Deployment is a kind of resource in Kubernetes which is usually used for stateless applications. The pods under Deployment should be used interchangeably, thus they do not have an order. The pod names have randomly generated hash, if the pod is restarted, it will have a different pod name. This makes the delta look weird. For example, when Acto changes the CR field
spec.image
fromimage:1
toimage:2
, the operator will restart the pods with the updated image. The system state will show a pod is deleted, and a new pod is created. However, since the pod names are different now, we will see one pod's image is changed fromimage:1
tonull
, and a new pod's image is changed fromnull
toimage:2
. This makes it for Acto to do the matching.Solution
We treat every pod in Deployment equally, and match them by index rather than name.
The text was updated successfully, but these errors were encountered: