Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track upgrade details #3527

Merged
merged 47 commits into from
Oct 19, 2023
Merged

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Oct 5, 2023

What does this PR do?

This PR allows the Agent to internally track most of the important states of the upgrade process. The two states not being tracked as part of this PR are UPG_SCHEDULING and UPG_ROLLBACK; tracking for these will be added in follow up PRs (see #3119 (comment) for rationale).

Implementation Details

This PR introduces a new package, github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/details. This package introduces two new types:

  • State: to represent the various upgrade states, e.g. UPG_REQUESTED, UPG_DOWNLOADING, etc.
  • Details: to encapsulate the upgrade state and other relevant upgrade details, e.g. the target version, any error upon failure, etc.

The Details type uses an Observer design pattern to notify any interested consumers of changes.

A new field, UpgradeDetails, and accompanying setter method, setUpgradeDetails is introduced on the coordinator.state struct. In the coordinator's Upgrade method, a new Details object is constructed and the new setter method is registered as an observer on this details object. As the agent progresses through (most of) the upgrade steps, the details object is updated. The Observer pattern then causes these updates to be reflected in coordinator.state.UpgradeDetails.

Why is it important?

To lay the foundations for increasing visibility into the Agent upgrade process.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Related issues

@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@elasticmachine
Copy link
Contributor

elasticmachine commented Oct 5, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-10-19T19:51:43.746+0000

  • Duration: 26 min 19 sec

Test stats 🧪

Test Results
Failed 0
Passed 6577
Skipped 59
Total 6636

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Contributor

mergify bot commented Oct 5, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b track-upgrade-details upstream/track-upgrade-details
git merge upstream/main
git push upstream track-upgrade-details

Comment on lines 78 to 87
d.mu.RLock()
defer d.mu.RUnlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use an RLock() here? Is it expected to have multiple concurrent NotifyObservers() calls (all the others functions guards are a normal Lock()) ?
If NotifyObservers() cannot be called concurrently then we don't need an RWMutex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be multiple observers registered. In such a case they would be called serially, in the loop inside the notifyObservers method.

However, even if there was only one single observer registered, the RLock call is needed to ensure that the internal state of d, e.g. d.State or d.Metadata, haven't changed between the caller calling NotifyObservers() and eventually the observer function being called inside the notifyObserver method. Also, I used an RLock instead of a Lock because I'm okay with multiple reads of d's internal state while NotifyObservers() is executing; I just don't want a write to sneak in there as explained earlier.

This comment was marked as duplicate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal I'm trying to achieve here is that for a given UpgradeDetails object, d, I want all it's observers to see the same value of d when d.NotifyObservers() is called. Because an RLock() would prevent concurrent writes to the fields of d, I think my goal is achieved.

You're right that a regular Lock() would also achieve that goal but it would be a stronger lock than I need. I don't necessarily care if there are two concurrent invocations of d.NotifyObservers() for the same d object; in fact, I would want both to be executed without the first one blocking the second one, which would happen if we used a Lock() here.

I think I am missing something here as I don't see other functions holding an RLock for reading state. If those reads happen without any lock at all, again there's no difference between RLock and a normal mutex.

Are you referring to notifyObservers() and notifyObserver()? I did not use any mutexes in there because otherwise we end up with a deadlock when d.SetState() is called. These are private methods and the public methods that call them acquire/release the mutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple RLock() can be acquired simultaneously so in the current code the notification would run concurrently. Is this the desired result ?

Yes, as explained in the previous comment.

If that's the case, did we write down somewhere that the Observers' code MUST be thread safe ?

No, but you bring up an excellent point. Observer's should receive a (deep) copy of the details object, not a pointer to it. The idea is that the observer receives a "snapshot-in-time" of the upgrade details at the time of observation. I have fixed this now in 5bbb507.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just in the manipulation of the UpgradeDetails, it's a warning for the implementation of the observers to protect their own state against concurrent calls...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just in the manipulation of the UpgradeDetails, it's a warning for the implementation of the observers to protect their own state against concurrent calls...

Now I think I might be missing something 😄. Not a rhetorical question: why should we add such a warning? Isn't it generally true that any object that wants to support concurrent access may want to protect its own state?

Copy link
Contributor Author

@ycombinator ycombinator Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pchila and I chatted over Zoom about this. The decision was to provide a setter method on UpgradeDetails for setting the download percentage, as this was the only operation that was being externally synchronized with a call to the NotifyObservers() method. By providing this setter and implementing the synchronization internally within that setter, we no longer need or allow writers of UpgradeDetails to directly notify observers; that happens automatically under the hood when these writers update some state of UpgradeDetails via setters. This is a much cleaner and safer approach!

internal/pkg/agent/application/upgrade/step_download.go Outdated Show resolved Hide resolved
@mergify
Copy link
Contributor

mergify bot commented Oct 13, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b track-upgrade-details upstream/track-upgrade-details
git merge upstream/main
git push upstream track-upgrade-details

@ycombinator ycombinator force-pushed the track-upgrade-details branch from a2d4a68 to fa15191 Compare October 13, 2023 12:45
@ycombinator ycombinator force-pushed the track-upgrade-details branch from b02c38e to 7346a5e Compare October 13, 2023 19:00
@@ -182,6 +189,8 @@ func (u *Upgrader) Upgrade(ctx context.Context, version string, sourceURI string
return nil, err
}

det.SetState(details.StateWatching)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the agent restarts essentially right after this, whether not we ever seen any of these transitions in Fleet depends on if a checkin happens to occur before we restart.

We would be guaranteed to see this on the first checkin of the new agent version if it were persisted somewhere. I assume something like this will be done when we add support for the UPG_ROLLBACK state?

Although realistically if the upgrade completes quickly and without error then knowing anything about it isn't that interesting. It's the slow and failure cases that are interesting to see.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I haven't yet thought through the implementation of for UPG_ROLLBACK but I expect we'll use the upgrade marker file to persist this state so we can communicate it from the Upgrade Watcher process to the main Agent process. We may want to expand on that idea and use the upgrade marker file to persist UPG_WATCHING as well — in fact, it will probably be the initial state in the upgrade marker file when it's created.

Longer term it would be nice to persist all upgrade states in the upgrade marker file and use it to drive the upgrade state machine; that way if an Agent restarts anywhere in the middle of an upgrade, it should be able to pick up from where it left off (assuming every or almost every step of the upgrade process is idempotent).

internal/pkg/agent/application/coordinator/coordinator.go Outdated Show resolved Hide resolved
internal/pkg/agent/application/coordinator/coordinator.go Outdated Show resolved Hide resolved
@ycombinator
Copy link
Contributor Author

CI is failing because the TestCoordinatorInitiatesUpgrade unit test is timing out. Investigating...

@ycombinator
Copy link
Contributor Author

CI is now failing because of the following data race. Investigating...

[2023-10-13T20:45:58.475Z] === FAIL: internal/pkg/agent/application/upgrade/artifact/download/http TestDownloadLogProgressWithLength (2.27s)
[2023-10-13T20:45:58.475Z] ==================
[2023-10-13T20:45:58.475Z] WARNING: DATA RACE
[2023-10-13T20:45:58.475Z] Write at 0x00c0001402c0 by goroutine 15:
[2023-10-13T20:45:58.475Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*detailsProgressObserver).ReportCompleted()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/progress_observer.go:117 +0x44
[2023-10-13T20:45:58.475Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*downloadProgressReporter).ReportComplete()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/progress_reporter.go:108 +0x1cc
[2023-10-13T20:45:58.475Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*Downloader).downloadFile()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/downloader.go:222 +0xdc8
[2023-10-13T20:45:58.475Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*Downloader).download()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/downloader.go:150 +0x2cc
[2023-10-13T20:45:58.475Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*Downloader).Download()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/downloader.go:111 +0x1c4
[2023-10-13T20:45:58.475Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.TestDownloadLogProgressWithLength()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/downloader_test.go:126 +0x6c4
[2023-10-13T20:45:58.475Z]   testing.tRunner()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/.gvm/versions/go1.20.9.linux.arm64/src/testing/testing.go:1576 +0x180
[2023-10-13T20:45:58.475Z]   testing.(*T).Run.func1()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/.gvm/versions/go1.20.9.linux.arm64/src/testing/testing.go:1629 +0x40
[2023-10-13T20:45:58.475Z] 
[2023-10-13T20:45:58.475Z] Previous write at 0x00c0001402c0 by goroutine 22:
[2023-10-13T20:45:58.475Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*detailsProgressObserver).Report()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/progress_observer.go:112 +0x48
[2023-10-13T20:45:58.475Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*downloadProgressReporter).Report.func1()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/progress_reporter.go:84 +0x29c
[2023-10-13T20:45:58.475Z] 
[2023-10-13T20:45:58.475Z] Goroutine 15 (running) created at:
[2023-10-13T20:45:58.475Z]   testing.(*T).Run()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/.gvm/versions/go1.20.9.linux.arm64/src/testing/testing.go:1629 +0x5b4
[2023-10-13T20:45:58.476Z]   testing.runTests.func1()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/.gvm/versions/go1.20.9.linux.arm64/src/testing/testing.go:2036 +0x80
[2023-10-13T20:45:58.476Z]   testing.tRunner()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/.gvm/versions/go1.20.9.linux.arm64/src/testing/testing.go:1576 +0x180
[2023-10-13T20:45:58.476Z]   testing.runTests()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/.gvm/versions/go1.20.9.linux.arm64/src/testing/testing.go:2034 +0x6a8
[2023-10-13T20:45:58.476Z]   testing.(*M).Run()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/.gvm/versions/go1.20.9.linux.arm64/src/testing/testing.go:1906 +0x8e0
[2023-10-13T20:45:58.476Z]   main.main()
[2023-10-13T20:45:58.476Z]       _testmain.go:57 +0x2b8
[2023-10-13T20:45:58.476Z] 
[2023-10-13T20:45:58.476Z] Goroutine 22 (running) created at:
[2023-10-13T20:45:58.476Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*downloadProgressReporter).Report()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/progress_reporter.go:64 +0x238
[2023-10-13T20:45:58.476Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*Downloader).downloadFile()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/downloader.go:215 +0xba0
[2023-10-13T20:45:58.476Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*Downloader).download()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/downloader.go:150 +0x2cc
[2023-10-13T20:45:58.476Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*Downloader).Download()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/downloader.go:111 +0x1c4
[2023-10-13T20:45:58.476Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.TestDownloadLogProgressWithLength()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/downloader_test.go:126 +0x6c4
[2023-10-13T20:45:58.476Z]   testing.tRunner()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/.gvm/versions/go1.20.9.linux.arm64/src/testing/testing.go:1576 +0x180
[2023-10-13T20:45:58.476Z]   testing.(*T).Run.func1()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/.gvm/versions/go1.20.9.linux.arm64/src/testing/testing.go:1629 +0x40
[2023-10-13T20:45:58.476Z] ==================

@elasticmachine
Copy link
Contributor

elasticmachine commented Oct 13, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.824% (84/85) 👍 0.014
Files 67.105% (204/304) 👍 0.109
Classes 66.19% (370/559) 👍 0.244
Methods 53.583% (1174/2191) 👍 0.341
Lines 39.807% (13771/34594) 👍 0.209
Conditionals 100.0% (0/0) 💚

@ycombinator ycombinator requested review from pchila and faec October 13, 2023 22:45
@ycombinator ycombinator force-pushed the track-upgrade-details branch from dcfec79 to 52987e7 Compare October 19, 2023 19:46
@elastic-sonarqube
Copy link

@ycombinator ycombinator merged commit 92acf08 into elastic:main Oct 19, 2023
@ycombinator ycombinator deleted the track-upgrade-details branch October 19, 2023 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip enhancement New feature or request skip-changelog Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants