Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shuffle hardware inventory for tinkerbell before reservation #8264

Merged
merged 2 commits into from
Jul 23, 2024

Conversation

rahulbabu95
Copy link
Member

Description of changes:
Shuffle hardware inventory before reserving hardware for Tinkerbell E2E tests. As we run quick e2e more frequently the boot entries on the boot list get populated quickly leading to an error when there's no space left to add to that boot list. Ideally we should have an automation around removing the boot entries periodically on the BMCs but until then we should try reserving the hardware in random order for quick test to not burden the boot entries on the first few hardware. Also, with randomness we reduce the likelihood of picking up an erroneous hardware in case during repetitive quick E2E runs.

Testing (if applicable):
Kicked of run against my branch and verified that the hardware reserved for the test were different from the regular hardware (eksa-ci01 to eksa-ci12) that gets reserved at present.

Documentation added/planned (if applicable):

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@eks-distro-bot eks-distro-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 8, 2024
Copy link

codecov bot commented Jun 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.48%. Comparing base (d485120) to head (85a4cae).
Report is 112 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8264      +/-   ##
==========================================
+ Coverage   73.42%   73.48%   +0.06%     
==========================================
  Files         578      578              
  Lines       36054    36489     +435     
==========================================
+ Hits        26471    26814     +343     
- Misses       7905     7956      +51     
- Partials     1678     1719      +41     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@g-gaston g-gaston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm, only nit comments

Have we planned already the work to clean the boot entries? I'm totally ok merging this, it's a good patch, but it doesn't guarantee the problem won't happen again. In fact if I'm understanding this correctly, it will 100% happen, it will just take longer. And it doesn't seem like an easy issue to diagnose.

@@ -592,3 +596,10 @@ func logTinkerbellTestHardwareInfo(conf *instanceRunConf, action string) {
}
conf.Logger.V(1).Info(action+" hardware for TestRunner", "hardwarePool", strings.Join(hardwareInfo, ", "))
}

func shuffleHardwareInventory(invCatalogue *hardwareCatalogue) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why the use of inventory and catalogue? aren't they representing the same thing?

@@ -217,6 +218,9 @@ func RunTests(conf instanceRunConf, inventoryCatalogue map[string]*hardwareCatal
} else {
hardwareCatalogue = inventoryCatalogue[nonAirgappedHardware]
}
conf.Logger.Info("Shuffling hardware inventory for tinkerbell")
// shuffle hardware to introduce randomness during hardware reservation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would expand more on why randomness is desired. We don't do this to introduce randomness, we do this to avoid picking up the same machines on every run. Randomness is just the mechanism to achieve that goal.

@@ -592,3 +596,10 @@ func logTinkerbellTestHardwareInfo(conf *instanceRunConf, action string) {
}
conf.Logger.V(1).Info(action+" hardware for TestRunner", "hardwarePool", strings.Join(hardwareInfo, ", "))
}

func shuffleHardwareInventory(invCatalogue *hardwareCatalogue) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make this a method in hardwareCatalogue? It's manipulating the internal extructure, it seems like a good idea to abstract that in a method instead of exposing it like this.

I fact, don't you need to use the mutex? If I'm not mistaken the hardwareCatalogue is shared between runner threads and all of them are going to try to call this method concurrently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I should have seen this.

@rahulbabu95
Copy link
Member Author

overall lgtm, only nit comments

Have we planned already the work to clean the boot entries? I'm totally ok merging this, it's a good patch, but it doesn't guarantee the problem won't happen again. In fact if I'm understanding this correctly, it will 100% happen, it will just take longer. And it doesn't seem like an easy issue to diagnose.

I think long term, we were looking into a solution to automate the cleanup of hardware. Jacob has a runbook on how to do it manually and we still have to figure some nitbits there before it could be fully automated. I can create a issue on CI board to see if that has to be tracked.

Copy link
Member

@g-gaston g-gaston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@rahulbabu95
Copy link
Member Author

/approve

@eks-distro-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rahulbabu95

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@eks-distro-bot eks-distro-bot merged commit 7036dcb into aws:main Jul 23, 2024
12 checks passed
@sp1999
Copy link
Member

sp1999 commented Oct 24, 2024

/cherry-pick release-0.20

@eks-distro-pr-bot
Copy link
Contributor

@sp1999: new pull request created: #8898

In response to this:

/cherry-pick release-0.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants