Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NO-ISSUE: Add filtered controller logs when timeout during installation. #2587

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bkopilov
Copy link
Contributor

@bkopilov bkopilov commented Dec 3, 2024

During CI runs we got timeout failures without any root cause for failures. The default timeout is 3600 seconds.

We are going to increase total masters and operators installed and probably May hit on timeout and the current timers are not enough.

Added support for filtering last messages from assisted controller log when timeout occured. In case controller logs exists on timeout we extract last error/ warning and report them as part of the raise exception

We expect the see additional info when bubbled up timeout exception and get more details from the raised exception
.

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 3, 2024
@bkopilov
Copy link
Contributor Author

bkopilov commented Dec 3, 2024

Copy link

openshift-ci bot commented Dec 3, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bkopilov
Once this PR has been reviewed and has the lgtm label, please assign avishayt for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 3, 2024
Copy link

openshift-ci bot commented Dec 3, 2024

Hi @bkopilov. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@danmanor
Copy link
Contributor

danmanor commented Dec 4, 2024

@bkopilov Do we have a way to reproduce the issue and check it works ?

@bkopilov
Copy link
Contributor Author

bkopilov commented Dec 4, 2024

@bkopilov Do we have a way to reproduce the issue and check it works ?

Yes , i changed the timeout for 10 minutes .
We have flow -> on timeout exception or any other exception raised ->
Try to get the assisted-controller.log :

  • if no catch exception and return []
  • if yes extract / filter. in case of error exception should be caught and set to []

@bkopilov
Copy link
Contributor Author

bkopilov commented Dec 4, 2024

@bkopilov Do we have a way to reproduce the issue and check it works ?

Yes , i changed the timeout for 10 minutes . We have flow -> on timeout exception or any other exception raised -> Try to get the assisted-controller.log :

  • if no catch exception and return []
  • if yes extract / filter. in case of error exception should be caught and set to []

Example from Run:

      raise Exception(e.__class__.__name__, e.__dict__)

E Exception: ('TimeoutExpired', {'_timeout_seconds': 60, '_what': "Monitored ['builtin'] operators to be in of the statuses ['available']", 'notes': ["Failed to deploy the following operators ['console', 'cvo']"], 'filter_message': '['time="2024-12-04T10:36:39Z" level=error msg="Failed to get list of nodes from k8s client" func="github.com/openshift/assisted-installer/src/assisted_installer_controller.(*controller).waitAndUpdateNodesStatus" file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:256" error="the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)" request_id=a1b43260-4693-49f8-9c1f-ec8b1cd08044\n']'})

src/assisted_test_infra/test_infra/helper_classes/cluster.py:706: Exception

@danmanor
Copy link
Contributor

danmanor commented Dec 4, 2024

@bkopilov when the PR is ready, lets run the tests

During CI runs we got timeout failures without any root cause for failures.
The default timeout is 3600 seconds.

We are going to increase total masters and operators installed and probably
May hit on timeout and the current timers are not enough.

Added support for filtering last messages from assisted controller log when timeout occured.
In case controller logs exists on timeout we extract last error/ warning and report them
as part of the raise exception

We expect the see additional info when bubbled up timeout exception and get more details
from the raised exception
.
@bkopilov bkopilov force-pushed the contoller_timeout_last branch from 0ef023b to e7c61ba Compare December 4, 2024 11:52
@bkopilov
Copy link
Contributor Author

bkopilov commented Dec 4, 2024

@bkopilov when the PR is ready, lets run the tests

Its ready , i will start running full CI with it

@bkopilov bkopilov marked this pull request as ready for review December 4, 2024 11:54
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 4, 2024
@openshift-ci openshift-ci bot requested review from avishayt and rccrdpccl December 4, 2024 11:55
@bkopilov
Copy link
Contributor Author

Hi , Example of a test when the installation failed:
https://ci-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-assisted-installer-virt-tf/21297/

@eifrach
Copy link
Contributor

eifrach commented Dec 11, 2024

/retitle NO-ISSUE: Add filtered controller logs when timeout during installation.

@openshift-ci openshift-ci bot changed the title Add filtered controller logs when timeout during installation. NO-ISSUE: Add filtered controller logs when timeout during installation. Dec 11, 2024
@openshift-ci-robot
Copy link

@bkopilov: This pull request explicitly references no jira issue.

In response to this:

During CI runs we got timeout failures without any root cause for failures. The default timeout is 3600 seconds.

We are going to increase total masters and operators installed and probably May hit on timeout and the current timers are not enough.

Added support for filtering last messages from assisted controller log when timeout occured. In case controller logs exists on timeout we extract last error/ warning and report them as part of the raise exception

We expect the see additional info when bubbled up timeout exception and get more details from the raised exception
.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 11, 2024
@bkopilov
Copy link
Contributor Author

Tested in CI when there is a timeout:
https://ci-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-assisted-installer-virt-tf/21346//testReport/junit/test_lvms_multi/TestLVMSMulti/test_lvms_multi_with_mce_only_52000_20_3_36000_12_/

Exception: ('TimeoutExpired', {'_timeout_seconds': 3600, '_what': "Nodes to be in of the statuses ['installed']", 'filter_message': '[\'time="2024-12-12T14:43:11Z" level=error msg="Failed to get logs from kube-api, reading from file" func=github.com/openshift/assisted-installer/src/common.GetControllerPodLogs file="/remote-source/app/src/common/common.go:116" error="etcdserver: request timed out"\\n\', \'time="2024-12-12T14:42:18Z" level=error msg="Failed to get list of nodes from k8s client" func="github.com/openshift/assisted-installer/src/assisted_installer_controller.(*controller).waitAndUpdateNodesStatus" file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:256" error="the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)" request_id=56ae929a-e118-454d-860f-08f9ab09f1b9\\n\', \'time="2024-12-12T14:42:17Z" level=error msg="Failed to get list of CSRs." func=github.com/openshift/assisted-installer/src/k8s_client.k8sClient.ListCsrs file="/remote-source/app/src/k8s_client/k8s_client.go:367" error="the server was unable to return a response in the time allotted, but may still be processing the request (get certificatesigningrequests.certificates.k8s.io)"\\n\']'})

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 21, 2024
@openshift-merge-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants