-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
watchdog: Optimize WatchdogImpl::touch in preparation to more frequent petting of the watchdog. #13103
watchdog: Optimize WatchdogImpl::touch in preparation to more frequent petting of the watchdog. #13103
Conversation
Interlocking based on time can result in confusion when using simulated time and without calls to advanceAndWait between calls to GuarddogImpl::forceCheckForTest. Signed-off-by: Antonio Vicente <[email protected]>
…t petting of the watchdog. Signed-off-by: Antonio Vicente <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. Thanks for doing this.
source/server/guarddog_impl.cc
Outdated
: dog_(watch_dog), | ||
last_touch_count_(dog_->touchCount()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you should set this to 0 (the last touch count)
otherwise we'll fall through if (watched_dog->last_touch_count_ != touch_count)
on the first iteration of the new watchdog in step
. We'll end up using const auto ltt = watched_dog->last_touch_time_;
without it being explicitly set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice line 179-180 above, where the WatchedDog is created and touched in the following line to guarantee that the first guarddog step after creation goes down the "watched_dog->last_touch_count_ != touch_count" branch.
source/server/watchdog_impl.h
Outdated
} | ||
|
||
private: | ||
const Thread::ThreadId thread_id_; | ||
TimeSource& time_source_; | ||
std::atomic<std::chrono::steady_clock::duration> latest_touch_time_since_epoch_; | ||
std::atomic<uint64_t> touch_count_ = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we support 32 bit ARM? Will this work there?
I sort of expect the std::atomic vs absl::Mutex discussion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide more context? I assume ARM supports atomic ops. Are they particularly expensive in ARM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be acceptable to make this a 32 bit type, wrap around behavior would be acceptable since we're just checking that the numbers are different. Another implementation option would be to have the guarddog reset the counter to 0 when stepping through which should prevent overflow.
run format Signed-off-by: Antonio Vicente <[email protected]>
Signed-off-by: Antonio Vicente <[email protected]>
source/server/watchdog_impl.h
Outdated
} | ||
// Accessors for use by GuardDogImpl to tell if the watchdog was touched recently and reset the | ||
// count back to 0 when it has. | ||
uint64_t touchCount() const { return touch_count_.load(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're going with uint32_t internally, perhaps we should return that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, fixed.
source/server/guarddog_impl.cc
Outdated
if (watched_dog->dog_->touchCount() > 0) { | ||
// Watchdog was touched since guarddog last checked; update time and reset the count. | ||
watched_dog->dog_->resetTouchCount(); | ||
watched_dog->last_touch_time_ = now; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we'll lose some accuracy since the last touch time is set here instead of when the watchdog is actually touched. Is this ok?
Should we then rename it to something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is still a timestamp from which the most recent touch was detected. I agree that there's a change in precision; watchdog miss/megamiss/kill/multikill would be delayed by up to min(miss_timeout / 2, megamiss_timeout / 2, kill_timeout / 2, multikill_timeout / 2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to last_checkin_ for consistency after your recent change
Signed-off-by: Antonio Vicente <[email protected]>
Signed-off-by: Antonio Vicente <[email protected]>
Signed-off-by: Antonio Vicente <[email protected]>
Signed-off-by: Antonio Vicente <[email protected]>
/retest |
Retrying Azure Pipelines, to retry CircleCI checks, use |
Can you merge main and see if that fixes CI? Not sure what is broken here. /wait |
Signed-off-by: Antonio Vicente <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a few optional drive by comments.
Thread::LockGuard guard(mutex_); | ||
if (!run_thread_) { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why this lock now needs to be held for the entire function. Can you possibly add some more comments? Should there by GUARDED_BY annotations to make this more clear? Maybe this is just to simplify the logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment. This is part of the fix to test interlock; we need a full run of the step function after forceCheckForTest. Performance shouldn't suffer since the use of mutex_ in the step() function should never be contended; the other uses of this mutex are GuardDogImpl::start, stop and forceCheckForTest.
source/server/guarddog_impl.cc
Outdated
if (watched_dog->dog_->touchCount() > 0) { | ||
// Watchdog was touched since the guard dog last checked; update time and reset the count. | ||
watched_dog->dog_->resetTouchCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perf nit: I think unless you do a relaxed load it's probably faster to just always do a reset and return the previous count, and then branch on that. Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the watchdog API so it provides a single method that returns if the dog was touched recently and reset touch state. I also changed the type to bool since it's more appropriate and switched to a looser memory order for performance.
Signed-off-by: Antonio Vicente <[email protected]>
Signed-off-by: Antonio Vicente <[email protected]>
e9db878
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with small comment but needs format fix.
/wait
source/server/watchdog_impl.h
Outdated
void touch() override { | ||
// Set touched_ if not already set. | ||
bool expected = false; | ||
touched_.compare_exchange_strong(expected, true, std::memory_order_release, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an atomics expert, but I'm pretty sure you can use store
with memory_order_relaxed
on this one, and use memory_order_relaxed
on the exchange above.
The whole thing is eventually consistent so I'm not sure any ordering matters and that would probably be faster?
The only thing I'm not sure of is whether this would effect test determinism, however, if you are using locks to serialize the tests that should force consistency of all previous operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm no expert either. After thinking about it some more, I think you're totally right - relaxed memory order is sufficient here. Switched to that.
Fix spelling Signed-off-by: Antonio Vicente <[email protected]>
Oops, spellcheck issue. Need to remember to run that before pushing changes. |
Signed-off-by: Antonio Vicente <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry one more comment otherwise LGTM.
/wait
|
||
// Server::WatchDog | ||
void startWatchdog(Event::Dispatcher& dispatcher) override; | ||
void touch() override { | ||
// Set touched_ if not already set. | ||
bool expected = false; | ||
touched_.compare_exchange_strong(expected, true, std::memory_order_release, | ||
std::memory_order_acquire); | ||
touched_.compare_exchange_strong(expected, true, std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need the CAS here. I think you can just do a relaxed store. It doesn't matter if it's already set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. I'm using CAS to avoid write in cases where touched_ is already true, since atomic reads are cheaper than atomic writes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Signed-off-by: Antonio Vicente <[email protected]>
Signed-off-by: Antonio Vicente <[email protected]>
Merge main once #13598 merges. Thanks! /wait |
Signed-off-by: Antonio Vicente <[email protected]>
Signed-off-by: Antonio Vicente <[email protected]>
Merged master, I see an old approval from you. Let me know if you have any newer comments or I should just merge in the PR. |
Go ahead and merge! |
* master: (22 commits) ci: various improvements (envoyproxy#13660) dns: fix defunct fd bug in apple resolver (envoyproxy#13641) build: support ppc64le with wasm (envoyproxy#13657) [fuzz] Added random load balancer fuzz (envoyproxy#13400) dependencies: compute and check release dates via GitHub API. (envoyproxy#13582) mac ci: try ignoring update failure (envoyproxy#13658) watchdog: Optimize WatchdogImpl::touch in preparation to more frequent petting of the watchdog. (envoyproxy#13103) typos: fix a couple 'enovy' mispellings (envoyproxy#13645) lua: Expose stream info downstreamLocalAddress and downstreamDirectRemoteAddress for Lua filter (envoyproxy#13536) tap: fix upstream streamed transport socket taps (envoyproxy#13638) Revert "delay health checks until transport socket secrets are ready. (envoyproxy#13516)" (envoyproxy#13639) Watchdog: use abort action as a default if killing is enabled. (envoyproxy#13523) [fuzz] Fixed divide by zero bug (envoyproxy#13545) wasm: flip the meaning of the "repository" in envoy_wasm_cc_binary(). (envoyproxy#13621) fix: record recovered local address (envoyproxy#13581) docs: fix incorrect compressor filter doc (envoyproxy#13611) docs: clean up docs for azp migration (envoyproxy#13558) wasm: fix building Wasm example. (envoyproxy#13619) test: Refactor flood tests into a separate test file (envoyproxy#13556) wasm: re-enable tests with precompiled modules. (envoyproxy#13583) ... Signed-off-by: Michael Puncel <[email protected]>
Commit Message:
watchdog: Optimize WatchdogImpl::touch in preparation to more frequent petting of the watchdog.
Additional Description:
Use an atomic counter to track if the watchdog was touched since the last time the guarddog checked. This avoids expensive calls to get the current monotonic time every time the watchdog is touched, and thus reduces the overhead incurred by touching the watchdog after every event loop callback.
This PR also includes pre-requisite fixes to the Guarddog's test interlock mechanism so it operates correctly in cases where simulated monotonic time does not move forward between calls to GuardDogImpl::forceCheckForTest()
Risk Level: low, slight change to how watchdog timeouts are computed.
Testing: existing tests.
Docs Changes: n/a
Release Notes: n/a
Issue #11391