-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Garbage collection of NodeFeature objects #1305
Conversation
✅ Deploy Preview for kubernetes-sigs-nfd ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
This is a "mega-PR" also containing some refactoring. I could try to split out some smaller chunks if that would help the review. /assign @ArangoGutierrez @PiotrProkop |
Rebased, still depends on #1311 but otherwise ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me but I will leave it for others also to review. Once you think it has gone through enough reviews, please tag me and i will add lgtm label.
// Handle NodeFeature objects | ||
nfs, err := n.nfdClient.NfdV1alpha1().NodeFeatures("").List(context.TODO(), metav1.ListOptions{}) | ||
if errors.IsNotFound(err) { | ||
klog.V(2).InfoS("NodeFeature CRD does not exist") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Perhaps should be CR instead of CRD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it is CRD indeed. If you cannot list the resource (and apiserver returns notfound) it means the whole api resource type does not exist
This is preparation for making it a generic garbage collector for all nfd-managed api objects.
Now all dependencies (i.e. PRs) have been merged and this is ready for review. Keeping in hold status until a few pair of eyes have looked at this (thanks @fmuyassarov for yours) |
Hook into the same logic already exercised for NodeResourceTopology objects: GC watches for node delete events and immediately drops stale objects (NRT and now also NF). In addition there is a periodic resync to catch any missed node deletes, once every hour by default.
Good job! |
LGTM label has been added. Git tree hash: e08c6bb458dca6273aebe743002f86f4f4923d94
|
Rename the old "topology-gc" to just "gc". Simplify the setup a bit by including the RBAC rules in the "gc" base. Note: we don't enable nfd-gc in the default overlay, yet, as the NodeFeature API isn't enabled (gc is not needed).
Rename files and parameters. Drop the container security context parameters from the Helm chart. There should be no reason to run the nfd-gc with other than the minimal privileges. Also updates the documentation.
I made one change: don't enable gc in the default kustomize overlay (yet) as NodeFeature API is not enabled there -> there is no work for gc to be done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thanks @marquiz
Feel free to remove the hold.
LGTM label has been added. Git tree hash: d14f109ea91adb0a78b71768fc8573d89b2bda87
|
@@ -450,7 +450,7 @@ topologyUpdater: | |||
affinity: {} | |||
podSetFingerprint: true | |||
|
|||
topologyGC: | |||
gc: | |||
enable: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I Think we should tie this to enableNodeFeatureApi: false
(line 13 of the same file)
why having it set to true
if currently default enableNodeFeatureApi
is false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't do any dynamic "templating" in the values file. It's sort of tied to that value in nfd-gc.yaml
(and some others) so that gc is not deployed if neither NodeFeature API nor topology-updater is enabled.
Any more concrete suggestions how to do this differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the nfd-gc.yaml
we could
{{- if and .Values.enableNodeFeatureApi (or .Values.topologyUpdater.enable) -}}
something like that, could work?.....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ArangoGutierrez, marquiz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #1305 +/- ##
==========================================
+ Coverage 28.85% 29.95% +1.10%
==========================================
Files 55 55
Lines 7348 7597 +249
==========================================
+ Hits 2120 2276 +156
- Misses 5001 5083 +82
- Partials 227 238 +11
|
This PR contains the following updates: | Package | Update | Change | |---|---|---| | [node-feature-discovery](https://github.com/kubernetes-sigs/node-feature-discovery) | minor | `0.13.4` -> `0.14.0` | --- ### Release Notes <details> <summary>kubernetes-sigs/node-feature-discovery (node-feature-discovery)</summary> ### [`v0.14.0`](https://github.com/kubernetes-sigs/node-feature-discovery/releases/tag/v0.14.0) [Compare Source](kubernetes-sigs/node-feature-discovery@v0.13.4...v0.14.0) #### What's new ##### NodeFeature API The [NodeFeature](https://kubernetes-sigs.github.io/node-feature-discovery/v0.14/usage/custom-resources.html#nodefeature) API is now enabled by default. The new CRD-based API replaces the previous gRPC-based communication between nfd-master and nfd-worker and, reducing network traffic and allows changes in NodeFeatureRules to take effect immediately (independent of the sleep-interval of nfd-worker). NodeFeature API can also be used to implement 3rd party extensions, see [customization guide](https://kubernetes-sigs.github.io/node-feature-discovery/v0.14/usage/customization-guide#nodefeature-custom-resource) for more details. Garbage collection of stale NodeFeature objects was added in the form of nfd-gc daemon. The gRPC API is now deprecated and will be removed in a future release. The related command-line flags are also deprecated (and don't have any effect when NodeFeature API is in use): - nfd-master: `-ca-file`, `-cert-file`, `-key-file`, `-port`, `-verify-node-name` - nfd-worker: `-ca-file`, `-cert-file`, `-key-file`, `-server`, `-server-name-override` ##### Metrics NFD now provides Prometheus metrics for better observability. Also, the Helm and kustomize deployments support enabling metrics collection with the [Prometheus operator](https://github.com/prometheus-operator/prometheus-operator). See the [documentation](https://kubernetes-sigs.github.io/node-feature-discovery/v0.14/deployment/metrics.html) for more information about the available metrics and deployment instructions. ##### Hooks disabled by default The deprecation of nfd-worker [hooks](https://kubernetes-sigs.github.io/node-feature-discovery/v0.14/usage/customization-guide.html#hooks) continues, disabling them by default in v0.14. Potential users of hooks are encouraged to switch to use the NFD CRDs ([NodeFeature](https://kubernetes-sigs.github.io/node-feature-discovery/v0.14/usage/customization-guide.html#nodefeature-custom-resource) and [NodeFeatureRule](https://kubernetes-sigs.github.io/node-feature-discovery/v0.14/usage/customization-guide.html#nodefeaturerule-custom-resource)) or [feature files](https://kubernetes-sigs.github.io/node-feature-discovery/master/usage/customization-guide.html#feature-files). Hooks can still be enabled with the [`sources.local.hooksEnabled`](https://kubernetes-sigs.github.io/node-feature-discovery/master/reference/worker-configuration-reference.html#sourceslocalhooksenabled) configuration option. ##### Feature files **Expiry time:** NFD now supports specifying an expiry time for the features specified in a feature file, providing better lifecycle management for the feature labels. See the [documentation](https://kubernetes-sigs.github.io/node-feature-discovery/master/usage/customization-guide.html#input-format) for more details. **Size limit:** There is now a 64kB size limit for feature files. ##### Miscellaneous ##### NodeFeatureRule API Dynamic values for labels is now supported by using the `@` notation, see [documentation](https://kubernetes-sigs.github.io/node-feature-discovery/v0.14/usage/customization-guide.html#labels) for more details. ##### NFD-Master - support for leader election was added, enabling high-availability deployments with multiple-replicas of nfd-master (with the NodeFeature API enabled) - dynamically configurable logging parameters via the config file - configurable resync period for the CRD controller - parallelized node updates, speeding up simultaneous updates of large number of nodes (e.g. update in NodeFeatureRules in a big cluster), can be controlled with the [`-nfd-api-parallelism`](https://kubernetes-sigs.github.io/node-feature-discovery/v0.14/reference/master-commandline-reference.html#-nfd-api-parallelism) flag ##### CPU features Detection of Intel TDX guests is now supported. ##### Logging The project was migrated structured logging, making log messages more consistent, better machine parseable and enables future improvements in logging. ##### Support policy The project now officially documented it's supported versions and deprecation policy, see the [documentation](https://kubernetes-sigs.github.io/node-feature-discovery/v0.14/reference/versions.html) for details. #### List of PRs - test/e2e: use proper context ([#​1154](kubernetes-sigs/node-feature-discovery#1154)) - deps: Update kubernetes to v1.27.1 ([#​1155](kubernetes-sigs/node-feature-discovery#1155)) - generate: update k8s code-generator to v0.27.1 ([#​1156](kubernetes-sigs/node-feature-discovery#1156)) - generate: update protoc to v22.3 ([#​1157](kubernetes-sigs/node-feature-discovery#1157)) - generate: update controller-gen to v0.11.3 ([#​1158](kubernetes-sigs/node-feature-discovery#1158)) - generate: update mockery to v2.25.1 ([#​1159](kubernetes-sigs/node-feature-discovery#1159)) - nfd-master: support noPublish with -prune ([#​1161](kubernetes-sigs/node-feature-discovery#1161)) - nfd-master: fix -prune ([#​1160](kubernetes-sigs/node-feature-discovery#1160)) - nfd-master: don't create emtpy annotations ([#​1166](kubernetes-sigs/node-feature-discovery#1166)) - nfd-master: fix a crash when processing NodeFeatureRules ([#​1173](kubernetes-sigs/node-feature-discovery#1173)) - pkg/nfd-master/nfd-master.go: Fix typo ([#​1171](kubernetes-sigs/node-feature-discovery#1171)) - nfd-master: reject malformed extended resource dynamic capacity assignment ([#​1169](kubernetes-sigs/node-feature-discovery#1169)) - go.mod: update deps ([#​1178](kubernetes-sigs/node-feature-discovery#1178)) - OWNERS: add ArangoGutierrez as an approver ([#​1180](kubernetes-sigs/node-feature-discovery#1180)) - feat: add master resync period configurability ([#​1139](kubernetes-sigs/node-feature-discovery#1139)) - nfd-topology-updater: fix wrong kubelet_internal_checkpoint path and compare basename to full path ([#​1167](kubernetes-sigs/node-feature-discovery#1167)) - docs: add missing .md suffix to internal references ([#​1189](kubernetes-sigs/node-feature-discovery#1189)) - nfd-master: log node name when processing NodeFeatureRules ([#​1191](kubernetes-sigs/node-feature-discovery#1191)) - scripts/test-infra: provide PR info to codecov ([#​1194](kubernetes-sigs/node-feature-discovery#1194)) - Match usage and example for prepare-release.sh ([#​1196](kubernetes-sigs/node-feature-discovery#1196)) - apis/nfd: add unit tests for Feature type ([#​1190](kubernetes-sigs/node-feature-discovery#1190)) - Update README to v0.13.1 ([#​1197](kubernetes-sigs/node-feature-discovery#1197)) - scripts/test-infra: provide PR base SHA to codecov ([#​1199](kubernetes-sigs/node-feature-discovery#1199)) - codecov: drop required minimum coverage ratio of a commit to 0% ([#​1200](kubernetes-sigs/node-feature-discovery#1200)) - codecov: drop required minimum coverage ratio at patch level ([#​1201](kubernetes-sigs/node-feature-discovery#1201)) - nfd-master: refactor api-controller object handling ([#​1198](kubernetes-sigs/node-feature-discovery#1198)) - nfd-master: refactor filtering of labels, taints and ERs ([#​1202](kubernetes-sigs/node-feature-discovery#1202)) - helm: fix mount for nfd-master config ([#​1204](kubernetes-sigs/node-feature-discovery#1204)) - nfd-master: fix resync period config option ([#​1185](kubernetes-sigs/node-feature-discovery#1185)) - deployment/helm: fix default for kubeletStateDir parameter ([#​1207](kubernetes-sigs/node-feature-discovery#1207)) - deployment/kustomize: drop pod-resources mount for topology-updater ([#​1208](kubernetes-sigs/node-feature-discovery#1208)) - test/e2e: refactor matching of node properties ([#​1184](kubernetes-sigs/node-feature-discovery#1184)) - deployment/helm: avoid overlapping mount paths on topology-updater ([#​1212](kubernetes-sigs/node-feature-discovery#1212)) - deployment/helm: user dedicated serviceaccount for topology-updater ([#​1213](kubernetes-sigs/node-feature-discovery#1213)) - deployment/helm: improve handling of topologyUpdater.kubeletStateFiles ([#​1211](kubernetes-sigs/node-feature-discovery#1211)) - topology-updater: use node IP in the default configz URI ([#​1218](kubernetes-sigs/node-feature-discovery#1218)) - e2e: delete CRs only if found ([#​1221](kubernetes-sigs/node-feature-discovery#1221)) - Add leader election for nfd-master ([#​1219](kubernetes-sigs/node-feature-discovery#1219)) - Fixed typo in Header under deployment/kustomize.md ([#​1222](kubernetes-sigs/node-feature-discovery#1222)) - nfd-master: use close for stop channel ([#​1227](kubernetes-sigs/node-feature-discovery#1227)) - scripts/test-infra: bump golangci-lint to v1.52.2 ([#​1230](kubernetes-sigs/node-feature-discovery#1230)) - nfd-master: add validation of label names and values ([#​1228](kubernetes-sigs/node-feature-discovery#1228)) - Migrate to structured logging ([#​1223](kubernetes-sigs/node-feature-discovery#1223)) - scripts/test-infra: add logcheck to verify script ([#​1235](kubernetes-sigs/node-feature-discovery#1235)) - Update README to v0.13.2 ([#​1238](kubernetes-sigs/node-feature-discovery#1238)) - github: update new-release issue template ([#​1239](kubernetes-sigs/node-feature-discovery#1239)) - feat: support dynamic values for labels in the NodeFeatureRule ([#​1226](kubernetes-sigs/node-feature-discovery#1226)) - feat: parallelize nodes update ([#​1133](kubernetes-sigs/node-feature-discovery#1133)) - cpu: Discover TDX guests based on cpuid information ([#​1240](kubernetes-sigs/node-feature-discovery#1240)) - deployment/kustomize: use a named port for nfd gRPC service ([#​1243](kubernetes-sigs/node-feature-discovery#1243)) - Fix missing apostrophe for jq ([#​1245](kubernetes-sigs/node-feature-discovery#1245)) - Fix a typo on nfd-master cmd ([#​1244](kubernetes-sigs/node-feature-discovery#1244)) - Removal of the bases field as it is deprecated by kustomize ([#​1246](kubernetes-sigs/node-feature-discovery#1246)) - Docs: Fix typo on customization-guide ([#​1247](kubernetes-sigs/node-feature-discovery#1247)) - hooks: disable hooks by default from v0.14 ([#​1182](kubernetes-sigs/node-feature-discovery#1182)) - Remove pkg's imported twice ([#​1248](kubernetes-sigs/node-feature-discovery#1248)) - fix typo in helm chart ([#​1253](kubernetes-sigs/node-feature-discovery#1253)) - Stop ticker in time to avoid memory leak ([#​1255](kubernetes-sigs/node-feature-discovery#1255)) - nfd-master: check for nil references in nfdAPIUpdateAllNodes ([#​1258](kubernetes-sigs/node-feature-discovery#1258)) - cpu: Take cgroupsv1 into account when reading misc.capacity ([#​1265](kubernetes-sigs/node-feature-discovery#1265)) - go.mod: update kubernetes to v1.27.4 ([#​1268](kubernetes-sigs/node-feature-discovery#1268)) - github: update assignees in new-release issue template ([#​1274](kubernetes-sigs/node-feature-discovery#1274)) - Enable metrics via prometheus operator ([#​1242](kubernetes-sigs/node-feature-discovery#1242)) - README: update to v0.13.3 ([#​1276](kubernetes-sigs/node-feature-discovery#1276)) - docs: document version and deprecation policy ([#​1279](kubernetes-sigs/node-feature-discovery#1279)) - docs: fix toc of topology-updater and topology-gc reference ([#​1278](kubernetes-sigs/node-feature-discovery#1278)) - docs: remove useless TOCs ([#​1280](kubernetes-sigs/node-feature-discovery#1280)) - Add optional labels to the podmonitor ([#​1282](kubernetes-sigs/node-feature-discovery#1282)) - docs: describe supported Kubernetes versions ([#​1277](kubernetes-sigs/node-feature-discovery#1277)) - docs: deprecation policy for Helm chart params ([#​1283](kubernetes-sigs/node-feature-discovery#1283)) - Fix Topology Manager policy and scope not being updated after NRT creation ([#​1256](kubernetes-sigs/node-feature-discovery#1256)) - generate: bump tools to their latest versions ([#​1284](kubernetes-sigs/node-feature-discovery#1284)) - Improve metrics ([#​1288](kubernetes-sigs/node-feature-discovery#1288)) - docs: align metrics documentation with latest changes on naming ([#​1289](kubernetes-sigs/node-feature-discovery#1289)) - docs: unify formatting of NOTEs ([#​1292](kubernetes-sigs/node-feature-discovery#1292)) - source/local: trim whitespace from input ([#​1293](kubernetes-sigs/node-feature-discovery#1293)) - source/local: support comments in input ([#​1294](kubernetes-sigs/node-feature-discovery#1294)) - nfd-master: use term node update instead of labeling ([#​1291](kubernetes-sigs/node-feature-discovery#1291)) - docs: document -metrics flag in command line reference ([#​1296](kubernetes-sigs/node-feature-discovery#1296)) - fix empty hugepages in some numa nodes caused no such file or directory errors ([#​1287](kubernetes-sigs/node-feature-discovery#1287)) - scripts/test-infra: update logcheck tool to v0.6.0 ([#​1299](kubernetes-sigs/node-feature-discovery#1299)) - scripts/test-infra: bump golangci-lint to v1.54.0 ([#​1300](kubernetes-sigs/node-feature-discovery#1300)) - Update kubernetes to v1.28.0 ([#​1302](kubernetes-sigs/node-feature-discovery#1302)) - docs: update github-pages gem to v228 ([#​1303](kubernetes-sigs/node-feature-discovery#1303)) - topology-gc: fix Stop ([#​1306](kubernetes-sigs/node-feature-discovery#1306)) - topology-gc: rename run() ([#​1309](kubernetes-sigs/node-feature-discovery#1309)) - topology-gc: rename runGC to garbageCollect() ([#​1310](kubernetes-sigs/node-feature-discovery#1310)) - nfd-topology-updater: add metrics support ([#​1295](kubernetes-sigs/node-feature-discovery#1295)) - topology-gc: refactor unit tests ([#​1307](kubernetes-sigs/node-feature-discovery#1307)) - topology-gc: move initial GC out of startNodeInformer() ([#​1308](kubernetes-sigs/node-feature-discovery#1308)) - topology-gc: simplify listing of node objects ([#​1311](kubernetes-sigs/node-feature-discovery#1311)) - metrics: additional metrics for nfd-master ([#​1290](kubernetes-sigs/node-feature-discovery#1290)) - Garbage collection of NodeFeature objects ([#​1305](kubernetes-sigs/node-feature-discovery#1305)) - topology-updater: make -version always runnable ([#​1297](kubernetes-sigs/node-feature-discovery#1297)) - go.mod: update kubernetes to v1.28.1 ([#​1315](kubernetes-sigs/node-feature-discovery#1315)) - Makefile: increase golangci-lint timeout to 10min ([#​1320](kubernetes-sigs/node-feature-discovery#1320)) - docs: use ruby docker image for building docs ([#​1319](kubernetes-sigs/node-feature-discovery#1319)) - README: update to v0.13.4 ([#​1324](kubernetes-sigs/node-feature-discovery#1324)) - test: add node updater pool unit tests ([#​1252](kubernetes-sigs/node-feature-discovery#1252)) - docs: nfd-updater: clarify accounting ([#​1321](kubernetes-sigs/node-feature-discovery#1321)) - nfd-updater: events: enable timer-only flow ([#​1325](kubernetes-sigs/node-feature-discovery#1325)) - docs: demote hooks in the customization guide ([#​1326](kubernetes-sigs/node-feature-discovery#1326)) - Feat: add expiry date for feature files ([#​1285](kubernetes-sigs/node-feature-discovery#1285)) - Dockerfile: bump grpc-health-probe to v0.4.19 ([#​1327](kubernetes-sigs/node-feature-discovery#1327)) - e2e/test: make the nfd-gc test pass on one-node cluster ([#​1328](kubernetes-sigs/node-feature-discovery#1328)) - Enable NodeFeature API by default ([#​1329](kubernetes-sigs/node-feature-discovery#1329)) - tls.md: Add note ([#​1332](kubernetes-sigs/node-feature-discovery#1332)) - nfd_gc_test.go: fix multiple import of same pkg ([#​1333](kubernetes-sigs/node-feature-discovery#1333)) - feat: add feature file size limit ([#​1335](kubernetes-sigs/node-feature-discovery#1335)) - sources/custom: convert static rules to new format ([#​1336](kubernetes-sigs/node-feature-discovery#1336)) - nfd-master: add config file options for logging ([#​1338](kubernetes-sigs/node-feature-discovery#1338)) - Deprecate gRPC API ([#​1334](kubernetes-sigs/node-feature-discovery#1334)) - Helm: conditionally add annotations if defined ([#​1331](kubernetes-sigs/node-feature-discovery#1331)) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNi4yMy4yIiwidXBkYXRlZEluVmVyIjoiMzYuMjMuMiIsInRhcmdldEJyYW5jaCI6Im1haW4ifQ==--> Reviewed-on: https://git.home/nrdufour/home-ops/pulls/78 Co-authored-by: Renovate <[email protected]> Co-committed-by: Renovate <[email protected]>
This PR implements garbage collection for NodeFeature objects. It achieves this by renaming the
nfd-topology-gc
daemon to more genericnfd-gc
and extends it to clean up stale NodeFeatures.Fixes #1304