Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wicket logs / EventReports after running multiple mupdates are enormous #3102

Closed
jgallagher opened this issue May 12, 2023 · 2 comments · Fixed by #3126
Closed

wicket logs / EventReports after running multiple mupdates are enormous #3102

jgallagher opened this issue May 12, 2023 · 2 comments · Fixed by #3126
Labels
wicket Operator Interaction Via Technician Port

Comments

@jgallagher
Copy link
Contributor

Today when trying to update 6 gimlets in the dogfood rack simultaneously, I noticed a few oddities:

  • After it had been running for a while, wicket became sluggish
  • Within the switch zone, /tmp/wicket.log was several gigabytes, and growing quickly (~ 1GiB/minute, eyeballing it)
  • From skimming the logs, the vast majority was debug prints of EventReports, many of which were extremely large (75,000+ lines), repeated very frequently

I think there are at least two things to address here:

  1. Stop debug logging the full event reports
  2. Why are the event reports so large?

For 2, the issue might be where we generate the report; should that be generate_report_since(self.last_reported)?

@jgallagher jgallagher added the wicket Operator Interaction Via Technician Port label May 12, 2023
@jgallagher jgallagher added this to the Manufacturing PVT1 milestone May 12, 2023
@davepacheco
Copy link
Collaborator

Should we log these to disk rather than an in-memory filesystem? I'm not sure how the cap on /tmp works but if it lets you use all of the system's physical memory, that would explain the sluggishness (but that would reflect a pretty serious issue, I think). And/or maybe we don't have a swap cap on this zone?

@jgallagher
Copy link
Contributor Author

I suspect the sluggishness was just in all the debug formatting? It was debug-formatting several hundred thousand lines of stuff every second; even if there was no real I/O that can't be free.

jgallagher added a commit that referenced this issue May 16, 2023
1. wicketd: Log the hash of the extracted host phase 2 artifact (we
already log the hashes of the other artifacts, and at one point I wanted
to find this - cracking open the tuf repo and then untarring the
composite host artifact was annoying)
2. wicket: Only log the full EventReport we receive about one per
minute. We were hitting #3102 so these were huge, but even after that's
fixed I don't think we need to log these every second.
3. wicket: Remove the ESC/ENTER titles on the main panes. Switching
between panes now happens via TAB.
4. wicket: Fix ESC closing the ignition popup - previously it always
sent us back to the rack view, and there was no way to close the
ignition popup other than running one of the commands.
sunshowers added a commit that referenced this issue May 16, 2023
Deduplicate nested events coming from remote machines in a couple of spots:

1. When we're generating nested events, such that duplicate events don't make their way over the event channel. Maintain event buffers within `StepContext`s to handle that.
2. In `EventBuffer` itself, dedup events based on the leaf index rather than the root index. If two events have the same leaf event index and leaf execution ID, they can always be deduped.

While testing this, I also realized it would be really useful to add a `root_execution_id` to every event report: do so, and depend on it (note that you'll have to rebuild installinator with this PR to pick up support for that).

Finally, address the review comment in #3081 -- do it here to avoid having to create a whole separate PR for this.

Fixes #3102.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wicket Operator Interaction Via Technician Port
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants