-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpointing race condition #18901
Comments
This is interesting, but I'm not sure how it can actually happen. Two lines above the code block referenced above, the Could we see the Vector configuration to verify what the details of the setup is here? |
That is even more interesting and puzzling. I am not aware of anything that changed between 0.33.0 and 0.34.0 in the area of the file source other than some trivial reorganizations (that didn't even touch actual code). Thanks for the update. I look forward to hearing the results of your continued testing. |
After bumping threads back up, the issue has still not come back since rolling out 0.34.0, so going to close this. Not necessarily a satisfying conclusion, since it's still a bit mysterious, but I think it's probably not worth digging into too much. |
A note for the community
Problem
I was experiencing a low rate of checkpointing errors while using a multi-threaded agent with a (very) aggressive
glob_minimum_cooldown_ms
(1s):(note the log scale of the y-axis)
Here is a corresponding log event for one of these errors:
A vast majority of checkpoints are successful, and I've confirmed that the checkpoint file is being updated regularly.
The problem seems to be related to this block, specifically on the
fs::rename
call. Since a static temp file name is used, then there could be a race condition between multiple threads running this block concurrently: both write data to the temp checkpoint (and one of them "wins"), then they both go to rename the file... and one of them wins, and the other emits an error.NOTE: I am not sure about the threading model; I haven't gotten that far. So at least part of this is conjecture. However, I was able to confirm the behavior with inotify:
In this specific example, the two attempts to write the checkpoints file were sufficiently separated to not trigger the condition, but it highlights how things are working.
I have since increased the
glob_minimum_cooldown_ms
, which should reduce the likelihood of seeing this problem, and for now, changed my agents to use a single thread (which, if my theory is correct, should completely eliminate the error going forward), but I don't like either of these workarounds (particularly reducing to a single thread).What concerns me possibly even more are the implications of this behavior beyond what the errors actually show. If there are multiple threads, each trying to write their portion of the checkpoint information, is it possible that some updates are being stomped on? I haven't gotten to chasing this down, yet, either.
An "easy" solution might just be to generate a fully-unique (or thread-unique) temp file name on every write, so each thread has its own file to write & move. But if there is some stomping problem where near-simultaneous writes might cause one threads' updates to be written but not another, maybe a more sophisticated solution is in order (e.g. w/ locking & partial updates). For that, I'll defer to someone a bit more familiar with the codebase.
Configuration
Version
vector 0.33.0 (x86_64-unknown-linux-gnu 89605fb 2023-09-27 14:18:24.180809939)
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
The text was updated successfully, but these errors were encountered: