-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Config: make writing to disk as atomic as possible #4607
Config: make writing to disk as atomic as possible #4607
Conversation
91714e0
to
6d08b19
Compare
Codecov Report
@@ Coverage Diff @@
## release/1.5.1 #4607 +/- ##
=================================================
+ Coverage 79.50% 79.50% +0.01%
=================================================
Files 482 482
Lines 35315 35320 +5
=================================================
+ Hits 28073 28078 +5
Misses 7242 7242
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @sphuber , just some small comments
timestamp = os.path.getmtime(self.config_filepath) | ||
Config.from_file(self.config_filepath) | ||
|
||
# Sleep a second, because for some operating systems the time resolution is of the order of a second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, which OS has 1s time resolution?
It's interesting that this can work with processors that have clock cycles below ns ;-)
As far as I'm aware, tests should run "in the blink of an eye", i.e. 1s is already too long.. 0.1s would already be better..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
l'histoire se répete
TLDR: blame MacOS :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure what type of time resolution @giovannipizzi was referring to, but on my MacOS it is quite substantially better
In [1]: from os.path import getmtime
In [5]: from time import sleep
In [7]: from pathlib import Path
In [8]: a = Path('a')
In [9]: for i in range(10):
...: sleep(0.1)
...: a.touch()
...: print(getmtime('a'))
...:
1606909606.250035
1606909606.350413
1606909606.452756
1606909606.556789
1606909606.658411
1606909606.758832
1606909606.864015
1606909606.968789
1606909607.069461
1606909607.171968
In [10]: for i in range(10):
...: sleep(0.01)
...: a.touch()
...: print(getmtime('a'))
...:
1606909614.754383
1606909614.765453
1606909614.775664
1606909614.785863
1606909614.796015
1606909614.806161
1606909614.817056
1606909614.827201
1606909614.837322
1606909614.848603
In [11]: for i in range(10):
...: sleep(0.001)
...: a.touch()
...: print(getmtime('a'))
...:
1606909621.740552
1606909621.741794
1606909621.74325
1606909621.744598
1606909621.745944
1606909621.747289
1606909621.748634
1606909621.749978
1606909621.751066
1606909621.752403
In [12]: for i in range(10):
...: sleep(0.0001)
...: a.touch()
...: print(getmtime('a'))
...:
1606909728.212414
1606909728.212609
1606909728.212802
1606909728.213136
1606909728.213311
1606909728.213494
1606909728.21371
1606909728.213929
1606909728.21422
1606909728.214399
In [13]: for i in range(10):
...: a.touch()
...: print(getmtime('a'))
...:
1606909932.093822
1606909932.093957
1606909932.094021
1606909932.094065
1606909932.094104
1606909932.094144
1606909932.094179
1606909932.094213
1606909932.094248
1606909932.094292
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this problem has been addressed in at least some of the MacOS versions, maybe it would be best to remove these sleeps. Note that this is not the only place where I am adding it. It is used for a lot of the verdi
tests that need to use the interactive writer. I would leave that for a separate PR though. I would be fine to remove it since the only problem it tackles is that certain unit tests will fail on certain MacOS systems. Probably not reason enough to slow down many of these tests all the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So:
- I have fixed the (testing) issue already upstream in click some time ago - this will appear in v8.0 (it's now in v8.0.0a1 according to GitHub)
- According to my old comments in that commit, the precision is 2 seconds (!!) on FAT32!
- According to my commit, at the time I was using Mac OS X 10.12. In 10.13 Apple switched to APFS that I guess fixed this and that must be why Leo does not see it anymore (BTW, it introduced a much more serious bug IMO with atomicity of file replacements, that I unveiled when hard-testing disk-objectstore! Still no reply from the developers, though). Since this is only for testing so it matters only to us developers, and I think nowadays probably everybody is developing on at least 10.13, probably we could do a PR and clean up these sleeps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to bring references: in the official Microsoft docs they say
[...] the resolution of create time on FAT is 10 milliseconds, while write time has a resolution of 2 seconds and access time has a resolution of 1 day, so it is really the access date. The NTFS file system delays updates to the last access time for a file by up to 1 hour after the last access.
And then you guys complain about the 1s resolution on older Macs? :-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @giovannipizzi for the info, that's good to know!
Does this mean that since we are now always creating a temporary file and renaming it, we actually have a time resolution of 10ms on FAT32 as well?
In general, one could also think of an alternative way of testing this: as long as we know where in our code the config file can be written (e.g. the atomic_write function), we could use pytest-mock
to mock this function and just check whether it is called. Then we don't need to rely on timing updates at the FS level.
The original implementation of the `Config.store` method, that writes the configuration from memory to disk, was using `shutil.copy` which is not an atomic operation. It is therefore possible that if the copy operation is interrupted the configuration is not written entirely and is left either empty or in a corrupted state. Making this an atomic operation in 100% of the cases is not trivial and even on the application level there are no guarantees as even on the file system level operations that should be atomic can be interrupted under very rare circumstances. However, we can make a significant improvement by replacing the use of `shutil.copy` with `os.rename`, which because it is a file move operation should be atomic and therefore less prone to being interrupted. The new algorithm for storing the config is as follows: 1. Create temporary file in same directory as target file 2. Write content of in memory config to the temporary file and flush 3. Use `os.rename` to move the temporary file to target destination This should prevent most common cases that could corrupt the existing configuration file. Finally, in `Config.from_method` now only `FileNotFoundError` is caught when trying to read the config file from disk. Originally both `IOError` and `OSError` were being caught, which can also be raised if an existing file could not be read. Simply overwriting in this case with a default config is not desirable and so in this case we simply let the exception bubble up.
6d08b19
to
4f2ab82
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, in that case good to merge from my side.
let's follow up on the timing issue though
Do you think this fixes issue #4604 and so I close it? Or do we leave it open? |
Let's close it and reopen in case it reoccurs |
Potentially addresses #4604
The original implementation of the
Config.store
method, that writesthe configuration from memory to disk, was using
shutil.copy
which isnot an atomic operation. It is therefore possible that if the copy
operation is interrupted the configuration is not written entirely and
is left either empty or in a corrupted state.
Making this an atomic operation in 100% of the cases is not trivial and
even on the application level there are no guarantees as even on the
file system level operations that should be atomic can be interrupted
under very rare circumstances.
However, we can make a significant improvement by replacing the use of
shutil.copy
withos.rename
, which because it is a file moveoperation should be atomic and therefore less prone to being
interrupted. The new algorithm for storing the config is as follows:
os.rename
to move the temporary file to target destinationThis should prevent most common cases that could corrupt the existing
configuration file.
Finally, in
Config.from_method
now onlyFileNotFoundError
is caughtwhen trying to read the config file from disk. Originally both
IOError
and
OSError
were being caught, which can also be raised if an existingfile could not be read. Simply overwriting in this case with a default
config is not desirable and so in this case we simply let the exception
bubble up.