Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/zenko 4912 #2160

Merged
merged 3 commits into from
Oct 9, 2024
Merged

Bugfix/zenko 4912 #2160

merged 3 commits into from
Oct 9, 2024

Conversation

benzekrimaha
Copy link
Contributor

@benzekrimaha benzekrimaha commented Oct 3, 2024

Issue : ZENKO-4912

@bert-e
Copy link
Contributor

bert-e commented Oct 3, 2024

Hello benzekrimaha,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@scality scality deleted a comment from bert-e Oct 3, 2024
@bert-e
Copy link
Contributor

bert-e commented Oct 3, 2024

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

@benzekrimaha benzekrimaha marked this pull request as ready for review October 3, 2024 12:56
@benzekrimaha benzekrimaha force-pushed the bugfix/ZENKO-4912 branch 2 times, most recently from 6b3899b to 9592b40 Compare October 7, 2024 19:36
@@ -1,4 +1,4 @@
VERSION="2.10.1"
VERSION="2.10.2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to rebase: this should be the last commit of the PR (where we will make the release)

@benzekrimaha benzekrimaha force-pushed the bugfix/ZENKO-4912 branch 2 times, most recently from aa04696 to 700ccd0 Compare October 8, 2024 10:38
@@ -183,10 +183,10 @@ groups:

- alert: MongoDbRSNotSynced
expr: |
sum by (rs_nm) (mongodb_rs_members_state{namespace="${namespace}", pod=~"${service}.*", member_state="SECONDARY"}) != (${replicas} - 1)
group by(rs_nm)(count by (rs_nm, pod)(mongodb_rs_members_state{namespace="${namespace}", pod=~"${service}.*", member_state="SECONDARY"}) != (${replicas} - 1) )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent to improve readabiltiy...

Suggested change
group by(rs_nm)(count by (rs_nm, pod)(mongodb_rs_members_state{namespace="${namespace}", pod=~"${service}.*", member_state="SECONDARY"}) != (${replicas} - 1) )
group by(rs_nm) (
count by(rs_nm, pod) (mongodb_rs_members_state{namespace="${namespace}", pod=~"${service}.*", member_state="SECONDARY"})
!= (${replicas} - 1)
)

monitoring/mongodb/alerts.yaml Outdated Show resolved Hide resolved
- series: mongodb_rs_members_state{namespace="zenko", pod="data-db-mongodb-sharded-shard0-data-2", member_state="SECONDARY", rs_nm="data-db-mongodb-sharded-shard-0", member_idx="data-db-mongodb-sharded-shard0-data-1.data-db-mongodb-sharded-headless.zenko.svc.cluster.local:27017"}
values: 2x10
- series: mongodb_rs_members_state{namespace="zenko", pod="data-db-mongodb-sharded-shard0-data-2", member_state="(not reachable/healthy)", rs_nm="data-db-mongodb-sharded-shard-0", member_idx="data-db-mongodb-sharded-shard0-data-2.data-db-mongodb-sharded-headless.zenko.svc.cluster.local:27017"}
values: 2 _ _ _ _ _ _ _ _ _
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when this serie disappears, the other series will actually change as well: as the remaining pods will not report on its status anymore... and the state reporte looks strange :/

  • On "startup", we should have series : everything is running fine (3 pods having the status of 3 members)
  • Then one pod "crashes" : the 3 associated series stop
  • Then the other pods update their "vision" of the cluster : they either report a different member state (not reachable?) or drop the member completely (not sure exactly what happens)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I saw the state is (not reachable/healthy) , the test was updated accordingly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every metric from pod="data-db-mongodb-sharded-shard0-data-2" should be removed when we simulate the pod crashing...

so i guess time series should look something like this:

# First pod
- series: member_state{pod="data-0", state="PRIMARY", member_idx="data-0"}
   values: 1x10 1x10
- series: member_state{pod="data-0", state="SECONDARY", member_idx="data-1"}
   values: 2x10 2x10
- series: member_state{pod="data-0", state="SECONDARY", member_idx="data-2"}
   values: 2x10 stale      # `stale` (i.e. not present) after the 10s sample
- series: member_state{pod="data-0", state="(not reachable/healthy)", member_idx="data-2"}
   values: _x10 8x10      # appears from the 11s sample
# Second pod
- series: member_state{pod="data-1", state="PRIMARY", member_idx="data-0"}
   values: 1x10 1x10
- series: member_state{pod="data-1", state="SECONDARY", member_idx="data-1"}
   values: 2x10 2x10
- series: member_state{pod="data-1", state="SECONDARY", member_idx="data-2"}
   values: 2x10 stale      # `stale` (i.e. not present) after the 10s sample
- series: member_state{pod="data-1", state="(not reachable/healthy)", member_idx="data-2"}
   values: _x10 8x10      # appears from the 11s sample
# Third pod : stops responding after 10th sample
- series: member_state{pod="data-2", state="PRIMARY", member_idx="data-0"}
   values: 1x10 stale
- series: member_state{pod="data-2", state="SECONDARY", member_idx="data-1"}
   values: 2x10 stale
- series: member_state{pod="data-2", state="SECONDARY", member_idx="data-2"}
   values: 2x10 stale

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing this : series: member_state{pod="data-0", state="SECONDARY", member_idx="data-2"}
values: 2x10 stale # stale (i.e. not present) after the 10s sample

will cause the expression to return 2 => expected number of secondaries as it's not taking the state="(not reachable/healthy)" metric

Copy link
Contributor

@francoisferrand francoisferrand Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, stale removes the metric series, which should be what really happens:

  • the serie with state label equal to secondary (and value 2) stops
  • another serie starts with state label equal to (not reachable/healthy) and value 8

(since one label value changes, this is really another serie : even if [as human] we think it is the same one...)

@benzekrimaha benzekrimaha force-pushed the bugfix/ZENKO-4912 branch 3 times, most recently from 6098f81 to 36a0a30 Compare October 8, 2024 13:53
Copy link
Contributor

@francoisferrand francoisferrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but alert test fail...

Copy link
Contributor

@francoisferrand francoisferrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor test improvements : please rebase/cleanup before approving

MongoDbRSNotSynced firing when it shouldn't because today we sum up the members state ,
as we are calculating the number of secondaries, we end up with a higher value |
than the expected one to have the right value an additional filtering based
on the instance have been introduced as well
Issue: ZENKO-4912
@benzekrimaha
Copy link
Contributor Author

/approve

@bert-e
Copy link
Contributor

bert-e commented Oct 8, 2024

In the queue

The changeset has received all authorizations and has been added to the
relevant queue(s). The queue(s) will be merged in the target development
branch(es) as soon as builds have passed.

The changeset will be merged in:

  • ✔️ development/2.10

The following branches will NOT be impacted:

  • development/2.5
  • development/2.6
  • development/2.7
  • development/2.8
  • development/2.9

There is no action required on your side. You will be notified here once
the changeset has been merged. In the unlikely event that the changeset
fails permanently on the queue, a member of the admin team will
contact you to help resolve the matter.

IMPORTANT

Please do not attempt to modify this pull request.

  • Any commit you add on the source branch will trigger a new cycle after the
    current queue is merged.
  • Any commit you add on one of the integration branches will be lost.

If you need this pull request to be removed from the queue, please contact a
member of the admin team now.

The following options are set: approve

@scality scality deleted a comment from bert-e Oct 9, 2024
@scality scality deleted a comment from bert-e Oct 9, 2024
@bert-e
Copy link
Contributor

bert-e commented Oct 9, 2024

I have successfully merged the changeset of this pull request
into targetted development branches:

  • ✔️ development/2.10

The following branches have NOT changed:

  • development/2.5
  • development/2.6
  • development/2.7
  • development/2.8
  • development/2.9

Please check the status of the associated issue ZENKO-4912.

Goodbye benzekrimaha.

@bert-e bert-e merged commit e0d7729 into development/2.10 Oct 9, 2024
14 checks passed
@bert-e bert-e deleted the bugfix/ZENKO-4912 branch October 9, 2024 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants