Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing race condition #18901

Closed
srstrickland opened this issue Oct 20, 2023 · 4 comments
Closed

Checkpointing race condition #18901

srstrickland opened this issue Oct 20, 2023 · 4 comments
Labels
source: file Anything `file` source related source: kubernetes_logs Anything `kubernetes_logs` source related type: bug A code related bug.

Comments

@srstrickland
Copy link
Contributor

srstrickland commented Oct 20, 2023

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I was experiencing a low rate of checkpointing errors while using a multi-threaded agent with a (very) aggressive glob_minimum_cooldown_ms (1s):
Screenshot 2023-10-20 at 1 38 24 PM

(note the log scale of the y-axis)

Here is a corresponding log event for one of these errors:

{
  "error": "No such file or directory (os error 2)",
  "error_code": "writing_checkpoints",
  "error_type": "writer_failed",
  "host": "vector-agent-hf742",
  "internal_log_rate_limit": true,
  "message": "Failed writing checkpoints.",
  "metadata": {
    "kind": "event",
    "level": "ERROR",
    "module_path": "vector::internal_events::file::source",
    "target": "vector::internal_events::file::source"
  },
  "pid": 1,
  "source_type": "internal_logs",
  "stage": "receiving"
}

A vast majority of checkpoints are successful, and I've confirmed that the checkpoint file is being updated regularly.

The problem seems to be related to this block, specifically on the fs::rename call. Since a static temp file name is used, then there could be a race condition between multiple threads running this block concurrently: both write data to the temp checkpoint (and one of them "wins"), then they both go to rename the file... and one of them wins, and the other emits an error.

NOTE: I am not sure about the threading model; I haven't gotten that far. So at least part of this is conjecture. However, I was able to confirm the behavior with inotify:

root@vector-agent-88skl:/# inotifywait -m -r -e create -e close -e move --timefmt '%F %T' --format '%T %w%f %e' /vector-data-dir/kubernetes_logs/
Setting up watches.  Beware: since -r was given, this may take a while!
Watches established.
2023-10-20 19:57:58 /vector-data-dir/kubernetes_logs/checkpoints.new.json CREATE
2023-10-20 19:57:58 /vector-data-dir/kubernetes_logs/checkpoints.new.json CLOSE_WRITE,CLOSE
2023-10-20 19:57:58 /vector-data-dir/kubernetes_logs/checkpoints.new.json MOVED_FROM
2023-10-20 19:57:58 /vector-data-dir/kubernetes_logs/checkpoints.json MOVED_TO
2023-10-20 19:57:58 /vector-data-dir/kubernetes_logs/checkpoints.new.json CREATE
2023-10-20 19:57:58 /vector-data-dir/kubernetes_logs/checkpoints.new.json CLOSE_WRITE,CLOSE
2023-10-20 19:57:58 /vector-data-dir/kubernetes_logs/checkpoints.new.json MOVED_FROM
2023-10-20 19:57:58 /vector-data-dir/kubernetes_logs/checkpoints.json MOVED_TO

In this specific example, the two attempts to write the checkpoints file were sufficiently separated to not trigger the condition, but it highlights how things are working.

I have since increased the glob_minimum_cooldown_ms, which should reduce the likelihood of seeing this problem, and for now, changed my agents to use a single thread (which, if my theory is correct, should completely eliminate the error going forward), but I don't like either of these workarounds (particularly reducing to a single thread).

What concerns me possibly even more are the implications of this behavior beyond what the errors actually show. If there are multiple threads, each trying to write their portion of the checkpoint information, is it possible that some updates are being stomped on? I haven't gotten to chasing this down, yet, either.

An "easy" solution might just be to generate a fully-unique (or thread-unique) temp file name on every write, so each thread has its own file to write & move. But if there is some stomping problem where near-simultaneous writes might cause one threads' updates to be written but not another, maybe a more sophisticated solution is in order (e.g. w/ locking & partial updates). For that, I'll defer to someone a bit more familiar with the codebase.

Configuration

kubernetes_logs:
  type: kubernetes_logs
  glob_minimum_cooldown_ms: 1000

Version

vector 0.33.0 (x86_64-unknown-linux-gnu 89605fb 2023-09-27 14:18:24.180809939)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

@srstrickland srstrickland added the type: bug A code related bug. label Oct 20, 2023
@bruceg
Copy link
Member

bruceg commented Nov 21, 2023

This is interesting, but I'm not sure how it can actually happen. Two lines above the code block referenced above, the last value is locked (with a Mutex) and the lock is held for the duration of the write sequence. This means it should not be possible to have concurrent writes to that file. Not only that, but there is a single checkpoint writer task spawned per file server source component, of which there should only be one per file or kubernetes source instance.

Could we see the Vector configuration to verify what the details of the setup is here?

@bruceg bruceg added source: file Anything `file` source related source: kubernetes_logs Anything `kubernetes_logs` source related labels Nov 21, 2023
@srstrickland
Copy link
Contributor Author

Yeah, it's definitely possible that my diagnosis is incorrect. I'm new to rust.

As for config, I am using a very basic kubernetes_log config:

kubernetes_logs:
  type: kubernetes_logs
  glob_minimum_cooldown_ms: 10000
  ingestion_timestamp_field: metadata.time.ingested
  use_apiserver_cache: true

The rest of the helm chart & config are pretty vanilla as well.

When I changed the launch params to include --threads=1, the errors diminished significantly, but did not disappear:
Screenshot 2023-11-22 at 1 03 23 PM

However, when I upgraded from 0.33.0 to 0.34.0, the errors have since completely disappeared:
Screenshot 2023-11-22 at 12 59 23 PM

I will remove the thread pinning and see what happens. Maybe this was fixed by virtue of something else.

@bruceg
Copy link
Member

bruceg commented Nov 22, 2023

That is even more interesting and puzzling. I am not aware of anything that changed between 0.33.0 and 0.34.0 in the area of the file source other than some trivial reorganizations (that didn't even touch actual code). Thanks for the update. I look forward to hearing the results of your continued testing.

@srstrickland
Copy link
Contributor Author

After bumping threads back up, the issue has still not come back since rolling out 0.34.0, so going to close this. Not necessarily a satisfying conclusion, since it's still a bit mysterious, but I think it's probably not worth digging into too much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: file Anything `file` source related source: kubernetes_logs Anything `kubernetes_logs` source related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

2 participants