Checkpointing race condition #18901

srstrickland · 2023-10-20T20:40:10Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I was experiencing a low rate of checkpointing errors while using a multi-threaded agent with a (very) aggressive glob_minimum_cooldown_ms (1s):

(note the log scale of the y-axis)

Here is a corresponding log event for one of these errors:

{
  "error": "No such file or directory (os error 2)",
  "error_code": "writing_checkpoints",
  "error_type": "writer_failed",
  "host": "vector-agent-hf742",
  "internal_log_rate_limit": true,
  "message": "Failed writing checkpoints.",
  "metadata": {
    "kind": "event",
    "level": "ERROR",
    "module_path": "vector::internal_events::file::source",
    "target": "vector::internal_events::file::source"
  },
  "pid": 1,
  "source_type": "internal_logs",
  "stage": "receiving"
}

A vast majority of checkpoints are successful, and I've confirmed that the checkpoint file is being updated regularly.

The problem seems to be related to this block, specifically on the fs::rename call. Since a static temp file name is used, then there could be a race condition between multiple threads running this block concurrently: both write data to the temp checkpoint (and one of them "wins"), then they both go to rename the file... and one of them wins, and the other emits an error.

NOTE: I am not sure about the threading model; I haven't gotten that far. So at least part of this is conjecture. However, I was able to confirm the behavior with inotify:

root@vector-agent-88skl:/# inotifywait -m -r -e create -e close -e move --timefmt '%F %T' --format '%T %w%f %e' /vector-data-dir/kubernetes_logs/
Setting up watches.  Beware: since -r was given, this may take a while!
Watches established.
2023-10-20 19:57:58 /vector-data-dir/kubernetes_logs/checkpoints.new.json CREATE
2023-10-20 19:57:58 /vector-data-dir/kubernetes_logs/checkpoints.new.json CLOSE_WRITE,CLOSE
2023-10-20 19:57:58 /vector-data-dir/kubernetes_logs/checkpoints.new.json MOVED_FROM
2023-10-20 19:57:58 /vector-data-dir/kubernetes_logs/checkpoints.json MOVED_TO
2023-10-20 19:57:58 /vector-data-dir/kubernetes_logs/checkpoints.new.json CREATE
2023-10-20 19:57:58 /vector-data-dir/kubernetes_logs/checkpoints.new.json CLOSE_WRITE,CLOSE
2023-10-20 19:57:58 /vector-data-dir/kubernetes_logs/checkpoints.new.json MOVED_FROM
2023-10-20 19:57:58 /vector-data-dir/kubernetes_logs/checkpoints.json MOVED_TO

In this specific example, the two attempts to write the checkpoints file were sufficiently separated to not trigger the condition, but it highlights how things are working.

I have since increased the glob_minimum_cooldown_ms, which should reduce the likelihood of seeing this problem, and for now, changed my agents to use a single thread (which, if my theory is correct, should completely eliminate the error going forward), but I don't like either of these workarounds (particularly reducing to a single thread).

What concerns me possibly even more are the implications of this behavior beyond what the errors actually show. If there are multiple threads, each trying to write their portion of the checkpoint information, is it possible that some updates are being stomped on? I haven't gotten to chasing this down, yet, either.

An "easy" solution might just be to generate a fully-unique (or thread-unique) temp file name on every write, so each thread has its own file to write & move. But if there is some stomping problem where near-simultaneous writes might cause one threads' updates to be written but not another, maybe a more sophisticated solution is in order (e.g. w/ locking & partial updates). For that, I'll defer to someone a bit more familiar with the codebase.

Configuration

kubernetes_logs:
  type: kubernetes_logs
  glob_minimum_cooldown_ms: 1000

Version

vector 0.33.0 (x86_64-unknown-linux-gnu 89605fb 2023-09-27 14:18:24.180809939)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

The text was updated successfully, but these errors were encountered:

bruceg · 2023-11-21T21:39:41Z

This is interesting, but I'm not sure how it can actually happen. Two lines above the code block referenced above, the last value is locked (with a Mutex) and the lock is held for the duration of the write sequence. This means it should not be possible to have concurrent writes to that file. Not only that, but there is a single checkpoint writer task spawned per file server source component, of which there should only be one per file or kubernetes source instance.

Could we see the Vector configuration to verify what the details of the setup is here?

srstrickland · 2023-11-22T21:06:26Z

Yeah, it's definitely possible that my diagnosis is incorrect. I'm new to rust.

As for config, I am using a very basic kubernetes_log config:

kubernetes_logs:
  type: kubernetes_logs
  glob_minimum_cooldown_ms: 10000
  ingestion_timestamp_field: metadata.time.ingested
  use_apiserver_cache: true

The rest of the helm chart & config are pretty vanilla as well.

When I changed the launch params to include --threads=1, the errors diminished significantly, but did not disappear:

However, when I upgraded from 0.33.0 to 0.34.0, the errors have since completely disappeared:

I will remove the thread pinning and see what happens. Maybe this was fixed by virtue of something else.

bruceg · 2023-11-22T23:10:49Z

That is even more interesting and puzzling. I am not aware of anything that changed between 0.33.0 and 0.34.0 in the area of the file source other than some trivial reorganizations (that didn't even touch actual code). Thanks for the update. I look forward to hearing the results of your continued testing.

srstrickland · 2023-12-08T02:26:14Z

After bumping threads back up, the issue has still not come back since rolling out 0.34.0, so going to close this. Not necessarily a satisfying conclusion, since it's still a bit mysterious, but I think it's probably not worth digging into too much.

srstrickland added the type: bug A code related bug. label Oct 20, 2023

bruceg added source: file Anything `file` source related source: kubernetes_logs Anything `kubernetes_logs` source related labels Nov 21, 2023

srstrickland closed this as completed Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpointing race condition #18901

Checkpointing race condition #18901

srstrickland commented Oct 20, 2023 •

edited

Loading

bruceg commented Nov 21, 2023

srstrickland commented Nov 22, 2023

bruceg commented Nov 22, 2023

srstrickland commented Dec 8, 2023

Checkpointing race condition #18901

Checkpointing race condition #18901

Comments

srstrickland commented Oct 20, 2023 • edited Loading

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

bruceg commented Nov 21, 2023

srstrickland commented Nov 22, 2023

bruceg commented Nov 22, 2023

srstrickland commented Dec 8, 2023

srstrickland commented Oct 20, 2023 •

edited

Loading