Use atomic write when persisting cache #9981
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #8147
I
strongly suspecthave prove that the cache gets corrupted because of a race between different ruff processes writing to the same cache file but with different content (e.g. format writes the cache with the "old" lint results and lint updates the lint results).Multiple processes writing to the same cache file is possible because POSIX only guarantees that a single
write
call is atomic, but ourimplementation uses a
BufWriter
that chunks the data into multiple write calls if necessary.This PR changes our
persist
implementation to use a temporary file instead and renames it on success. Renaming is guaranteedto be atomic. This approach has the added benefit of preventing cache corruption if ruff dies while writing the cache data (SIGKIL, panic, the computer shuts down...).
This PR removes the
BufWriter
because I noticed that the implementation became slower when using a temporary file, and theBufWriter
(I added the recommendedflush
call to theBufWriter
).Removing the
BufWriter
gives us about the same performance for the CPython benchmark with the default rules but results in a ~2% speedup when selecting all rules (instead of a 5% slowdown due to the use of a tempfile).Test Plan
I wrote a script to reproduce my theory
but failed to get a single reproduction.The script starts ten ruff instances in a loop with the default or all rules (coinflip). The instances must use different rules or all instances write the same cache file, which makes it impossible to show the race.
But without success. Ruff never fails with the old build :(I had to patch up the cache to a) use the same cache regardless of the settings, and b) never return cached data but always write to the cache.This allowed me to reproduce the bug fairly consistently on main. I'm no longer able to reproduce the issue with the changes from this PR:
Benchmarks