Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubevirt,presubmits: update storage lanes to run with etcd in mem #3626

Merged
merged 2 commits into from
Sep 19, 2024

Conversation

brianmcarey
Copy link
Member

@brianmcarey brianmcarey commented Sep 4, 2024

What this PR does / why we need it:

The storage e2e lanes see a large number of lane failures due to etcd timeouts[1] - running these lanes with etcd in memory should stop these etcd timeouts from occurring as we have seen with the other e2e lanes.

We saw instability when running with the default tmpfs size of 512M as this was filling up and causing etcd to restart.

{"level":"fatal","ts":"2024-09-18T12:16:27.580544Z","caller":"etcdserver/raft.go:224","msg":"failed to save Raft hard state and entries","error":"no space left on device","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*raftNode).start.func1\n\tgo.etcd.io/etcd/server/v3/etcdserver/raft.go:224"}

Increasing this to 1G improves the stability greatly.

[1] https://search.ci.kubevirt.io/?search=etcdserver%3A+request+timed+out&maxAge=336h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note:


@kubevirt-bot kubevirt-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 4, 2024
@kubevirt-bot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@kubevirt-bot kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Sep 4, 2024
@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-k8s-1.29-sig-storage
rehearsal-pull-kubevirt-e2e-k8s-1.30-sig-storage

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-k8s-1.29-sig-storage
rehearsal-pull-kubevirt-e2e-k8s-1.30-sig-storage

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-k8s-1.29-sig-storage
rehearsal-pull-kubevirt-e2e-k8s-1.30-sig-storage

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-k8s-1.30-sig-storage
rehearsal-pull-kubevirt-e2e-k8s-1.29-sig-storage

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-k8s-1.30-sig-storage
rehearsal-pull-kubevirt-e2e-k8s-1.29-sig-storage
rehearsal-pull-kubevirt-e2e-k8s-1.31-sig-storage

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@akalenyu
Copy link
Contributor

akalenyu commented Sep 9, 2024

@brianmcarey how is this looking? I think we're open minded about trying in mem etcd for a while to alleviate failures
@mhenriks wdyt

@brianmcarey
Copy link
Member Author

@brianmcarey how is this looking? I think we're open minded about trying in mem etcd for a while to alleviate failures @mhenriks wdyt

Not 100% sure on the stability of this - I want to run a few more rehearsals to see first. I checked the memory usage and it looks ok for the test pod with this enabled.

@akalenyu
Copy link
Contributor

akalenyu commented Sep 9, 2024

@brianmcarey how is this looking? I think we're open minded about trying in mem etcd for a while to alleviate failures @mhenriks wdyt

Not 100% sure on the stability of this - I want to run a few more rehearsals to see first. I checked the memory usage and it looks ok for the test pod with this enabled.

Correct me if I'm wrong but the test job definitions are also uncapped on the memory size (only request is specified)

@brianmcarey
Copy link
Member Author

@brianmcarey how is this looking? I think we're open minded about trying in mem etcd for a while to alleviate failures @mhenriks wdyt

Not 100% sure on the stability of this - I want to run a few more rehearsals to see first. I checked the memory usage and it looks ok for the test pod with this enabled.

Correct me if I'm wrong but the test job definitions are also uncapped on the memory size (only request is specified)

Yes they do not have any memory limits set.

@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-k8s-1.29-sig-storage
rehearsal-pull-kubevirt-e2e-k8s-1.30-sig-storage
rehearsal-pull-kubevirt-e2e-k8s-1.31-sig-storage

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-k8s-1.31-sig-storage
rehearsal-pull-kubevirt-e2e-k8s-1.29-sig-storage
rehearsal-pull-kubevirt-e2e-k8s-1.30-sig-storage

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-k8s-1.29-sig-storage
rehearsal-pull-kubevirt-e2e-k8s-1.30-sig-storage
rehearsal-pull-kubevirt-e2e-k8s-1.31-sig-storage

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 11, 2024
@brianmcarey brianmcarey force-pushed the storage-etcd-in-mem branch 2 times, most recently from 75affff to 505a0ea Compare September 12, 2024 12:54
@kubevirt-bot kubevirt-bot removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XS labels Sep 12, 2024
@kubevirt-bot kubevirt-bot removed the request for review from iholder101 September 19, 2024 09:17
@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-k8s-1.30-sig-storage
rehearsal-pull-kubevirt-e2e-k8s-1.29-sig-storage
rehearsal-pull-kubevirt-e2e-k8s-1.31-sig-storage

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

Copy link
Contributor

@akalenyu akalenyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to also apply this on 1.3/1.2 lanes?

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2024
@kubevirt-bot kubevirt-bot added size/M and removed lgtm Indicates that a PR is ready to be merged. size/S labels Sep 19, 2024
@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-k8s-1.30-sig-storage-1.3
rehearsal-pull-kubevirt-e2e-k8s-1.28-sig-storage-1.3
rehearsal-pull-kubevirt-e2e-k8s-1.29-sig-storage-1.3
rehearsal-pull-kubevirt-e2e-k8s-1.29-sig-storage
rehearsal-pull-kubevirt-e2e-k8s-1.31-sig-storage
rehearsal-pull-kubevirt-e2e-k8s-1.30-sig-storage
rehearsal-pull-kubevirt-e2e-k8s-1.29-sig-storage-1.2
rehearsal-pull-kubevirt-e2e-k8s-1.28-sig-storage-1.2
rehearsal-pull-kubevirt-e2e-k8s-1.27-sig-storage-1.2

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

Copy link
Contributor

@dhiller dhiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Thank you @brianmcarey !

Release

/hold

after the rehearsals are good.

@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 19, 2024
@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2024
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhiller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 19, 2024
@akalenyu
Copy link
Contributor

/lgtm

@xpivarc
Copy link
Member

xpivarc commented Sep 19, 2024

@dhiller I think we are good to unhold

@akalenyu
Copy link
Contributor

@brianmcarey: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
rehearsal-pull-kubevirt-e2e-k8s-1.29-sig-storage-1.2 ead44d6 link unknown /test pull-kubevirt-e2e-k8s-1.29-sig-storage-1.2
rehearsal-pull-kubevirt-e2e-k8s-1.28-sig-storage-1.2 ead44d6 link unknown /test pull-kubevirt-e2e-k8s-1.28-sig-storage-1.2

I don't think the 1.2 lanes are failing because of anything etcd or storage related. It looks like an error occurs in guest-console-log

@brianmcarey
Copy link
Member Author

/hold cancel

Rehearsals look good.

@kubevirt-bot kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 19, 2024
@kubevirt-bot kubevirt-bot merged commit 79224ba into kubevirt:main Sep 19, 2024
14 checks passed
@kubevirt-bot
Copy link
Contributor

@brianmcarey: Updated the job-config configmap in namespace kubevirt-prow at cluster default using the following files:

  • key kubevirt-presubmits-1.2.yaml using file github/ci/prow-deploy/files/jobs/kubevirt/kubevirt/kubevirt-presubmits-1.2.yaml
  • key kubevirt-presubmits-1.3.yaml using file github/ci/prow-deploy/files/jobs/kubevirt/kubevirt/kubevirt-presubmits-1.3.yaml
  • key kubevirt-presubmits.yaml using file github/ci/prow-deploy/files/jobs/kubevirt/kubevirt/kubevirt-presubmits.yaml

In response to this:

What this PR does / why we need it:

The storage e2e lanes see a large number of lane failures due to etcd timeouts[1] - running these lanes with etcd in memory should stop these etcd timeouts from occurring as we have seen with the other e2e lanes.

We saw instability when running with the default tmpfs size of 512M as this was filling up and causing etcd to restart.

{"level":"fatal","ts":"2024-09-18T12:16:27.580544Z","caller":"etcdserver/raft.go:224","msg":"failed to save Raft hard state and entries","error":"no space left on device","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*raftNode).start.func1\n\tgo.etcd.io/etcd/server/v3/etcdserver/raft.go:224"}

Increasing this to 1G improves the stability greatly.

[1] https://search.ci.kubevirt.io/?search=etcdserver%3A+request+timed+out&maxAge=336h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note:


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

akalenyu added a commit to akalenyu/project-infra that referenced this pull request Nov 21, 2024
this slipped but is definitely desired:
kubevirt#3626

Signed-off-by: Alex Kalenyuk <[email protected]>
kubevirt-bot pushed a commit that referenced this pull request Nov 21, 2024
this slipped but is definitely desired:
#3626

Signed-off-by: Alex Kalenyuk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants