Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mdb-community] A summary of evaluation results for mongodb-community operator #166

Closed
taham0 opened this issue Aug 17, 2022 · 6 comments
Closed
Assignees

Comments

@taham0
Copy link
Contributor

taham0 commented Aug 17, 2022

Evaluation Results

  • Acto ran 153 test cases over 77 fields
  • It produced about 120 alarms in the initial evaluation with 108 false alarms and 12 true alarms
  • It produced 61 alarms in the most recent evaluation with 48 false alarms, 3 true alarms
  • The remaining 10 alarms were produced without any output from the state oracle
  • The reduction in alarms can be largely attributed to the inclusion of secrets in Acto
False Alarms 48
Category Subtotal
Ineffective input generation 17
Invalid input 16
Input applied despite operator crash 7
Field doesn't map to a state in application 5
Null != default 2
Inconsistent operator state 1
Grand Total 48
True Alarms 3

True Alarms

The initial evaluation had reproduced 4 bugs. A total of 3 true alarms reproduced 2 bugs in the most recent evaluation.

  1. Operator crashes in face of an incomplete TLS configuration mongodb/mongodb-kubernetes-operator#1054
    • 2 of the alarms reproduced this bug.
  2. Operator crashes when spec.security.modes list is empty mongodb/mongodb-kubernetes-operator#1055
    • 1 of the alarms reproduced this bug.
  3. Mongodb system is down and unable to recover when the featureCompatibilityVersion is not specified and changed to an invalid value mongodb/mongodb-kubernetes-operator#1072
    • None of the alarms reproduced this bug.
    • I am yet to find the reason why this was not reproduced
  4. Changing scramCredentialsSecretName causes resource leak  mongodb/mongodb-kubernetes-operator#1074
    • None of the alarms reproduced this bug.
    • As discussed in Observability #160, this is because changing the scramCredentialsSecretName creates a new secret
    • After the inclusion of secrets in Acto, a match between system and input delta occurs so no alarm is raised

False Alarms

The initial evaluation had 108 false alarms. The number of false alarms reduced to 48 in the most recent evaluation.

  • 17 of these alarms occured because of ineffective input generation
    • All of these alarms occured because a non-nullable field was set to null so the CR failed to be applied due to a validation error
  • 16 of these alarms occured because of invalid input
    • 9 of these alarms occured because process names were invalid.
      The spec.automationConfig contains an array of processes. Each process has specific fields and corresponding values specified in the input to override the current processes in the operator-created automationConfig by merging. Specifically, the operator searches for the process names specified in the input among the current processes. Since the input process names are invalid, no match is found and the input is not merged and the automationConfig remains unchanged.
    • 7 of these alarms occured because an invalid resource reference is provided
      In order to enable TLS, a caCertificateSecret (or a caConfigMap) and a certificateKeySecret is required. None of these objects exist and Acto provides an invalid reference to one or more of these objects which is identified by the operator and results in a warning.
  • 7 of these alarms occured because a new configuration was applied despite that the operator crashed due to a previous mutation
  • 5 of these alarms occured because the field did not map to a state in the application
  • 2 of these alarms occured because the null field was changed to a default value
  • 1 of these alarms occured because of the inconsistent operator state due to a previously applied configuration
    A previously applied configuration left the operator in an inconsistent state. Since agent version does not match the goal state, the replicaSet is not ready. Consequently, the operator is unable to proceed towards creating / updating the connectionStringSecret.

Evaluation Result 28/08/22

The following changes were made:

  • Warn level in operator log included to eliminate invalid input
  • Seed CR changed
  • Format issue resolved for resource units (mapping 0.2 to 200M)
  • Detecting failure to apply configuration from cli output

True Alarms | 5
False Alarms | 13

False Alarm Category Subtotal
Field doesn't map to a state in application 4
CR applied over crashed operator 4
Input value is the same as previous value / default value (in impact) 2
Inconsistent operator state 2
Invalid input 1

Evaluation Result 31/08/22

True Alarms | 5
False Alarms | 4

False Alarm Category Subtotal
Field doesn't map to a state in application 2
Input value is the same as previous value / default value (in impact) 1
Ineffective input generation 1

@tianyin
Copy link
Member

tianyin commented Aug 17, 2022

Thanks @taham0 for the detailed explanation! The high FP rate has been the key challenge of Acto now. In this case, the FP is almost 94% (48/51) which is hard to make a case of a usable tool. @tylergu has been working hard on reducing FP and the FPs are indeed reduced by >2 times. However, it does not seem to be sufficient at this point. I don't know how the results of the other operators look like. My guess is that the numbers are low, as the FP reduction was designed based on the understanding of the other operators (we likely have overfitting issues).

@tylergu @Essoz can you look into the FP?

@tylergu
Copy link
Member

tylergu commented Aug 17, 2022

Thanks for the write up. Let’s setup a synchronous meeting to discuss the results.

Some of the false alarms seem to be true alarms, e.g. the 7 alarms caused because operator crashed.

@taham0 , did you directly analyze the results after running Acto? This is the results without the static analysis support, I think after applying the static analysis the some false alarms should be gone.

A lot of the false alarms seem to due to invalid input, if the invalid input is also indicated on the warning level, then we should also try capture the warning level log

The 17 false alarms caused because input getting rejected is a bug in Acto we just discovered recently, Acto should recognize this is an invalid input if any error message appear from kubectl’s stderr

I am more curious why the true alarm number has decreased so much, which true alarms can not be reproduced anymore.

@taham0
Copy link
Contributor Author

taham0 commented Aug 17, 2022

@tylergu

  • As I mentioned, True Alarms # 3 and # 4 were not reproduced.
  • Yes I realized we could reduce the 17 FAs by fixing the bug, its great that it has been fixed
  • I did apply the static analysis support, lets discuss it in the sync meeting
  • Let me clarify that the 7 alarms are not directly because the operator crashed (in that case they would be TAs). They are because the operator crashed in a previous mutation (due to one of the mentioned TAs) and then Acto applied a new configuration over the crashed operator
  • I am available for a meeting right now, or let me know whatever time is suitable for you

@taham0
Copy link
Contributor Author

taham0 commented Aug 17, 2022

@tianyin @tylergu
I think since a large number of alarms were due to the Acto bug and preceding operator crashes, the results can improve a lot. I am currently running the newest acto and will report the new results as soon as the process completes and after our meeting later today

@tianyin
Copy link
Member

tianyin commented Aug 17, 2022

That's awesome!! Thank you for all the hard work @taham0 !

@taham0
Copy link
Contributor Author

taham0 commented Aug 31, 2022

The FP rate for mongodb-community-operator has reduced to 44.4% (4 / 9 FA) after some improvements and the latest two evaluation results have been included above. All bugs were reproduced and a by-product bug was found.

@taham0 taham0 closed this as completed Aug 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants