-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing config file lock mechanism #352
Conversation
Hi @sverdlov93 I created a short reproducer script:
This typically triggers the corruption within a few seconds up to one Minute or so (be sure to back up your jfrog-cli.conf ;) ):
|
Hi @HWoidt, I am currently working on the fix, and have some more changes and tests to add. |
Hi, @HWoidt.
If you want to test it on your machine you can pull JFrog CLI PR and run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sverdlov93 ,
Sorry for letting you wait. I had other stuff on my plate.
I have now tested your changes on our Ubuntu 20.04 machine with the bash script I provided earlier:
- When running tasks concurrently the command often fails because the Stat-call (lock.go:169) on an already deleted lock-file causes the command to fail instead of being retried. I changed this locally and made the following observations with the Stat error ignored.
- There sometimes is a deadlock, both processes are waiting until their lock-retry loop runs out after ~2 minutes and then return with a failure. I suspect this to be due to two files having the same timestamp. This tie can occur with higher frequency than before because the filesystem timestamps have lower resolution.
- The corruption can still be observed, although with vastly lower probability. This occurred maybe 2 times in the 3h that I left it running with 40 concurrent processes. I fear that we are running into the same conceptual problem as with the previous solution: The file-creation and timestamp generation are not within the same critical section. Probably the filesystem first prepares all file meta-data (including the timestamp) and only then commits the changes to disk, allowing for a time window in which another process can run through the whole sequence of creating the file and checking all timestamps. Since this occurs within the same system call, the time between timestamp-creation and visibility of the new file is shorter, resulting in a lower probability of hitting this issue.
With your changes I see much reduced probability for this to occur. We initially ran into this after about 1 Month, while starting three concurrent jf-config processes at the same time once a day. Considering that in the tests the error now happened after more than 30min compared to the 10-20s it was before and with 4 times as many threads, we can assume at least a factor of 100x reduction in probability. On our CI server we have circumvented this issue now by assigning individual config files per worker, using the JFROG_CLI_HOME_DIR environment variable.
For correctness it would probably be best to use a mechanism explicitly designed for file locking (i.e. LockFileEx on windows and flock on linux). There is already a cross platform go implementation used for exactly the same usecase of protecting configfiles in the "go" commandline tool: go.internal.lockedfile. Unfortunately this is an internal package. There is a project containing useful internal packages from the go standard library which exports this though: https://github.com/rogpeppe/go-internal
} | ||
// Last element is the timestamp. | ||
time, err := strconv.ParseInt(splitted[4], 10, 64) | ||
fileInfo, err := os.Stat(path) | ||
if err != nil { | ||
return nil, errorutils.CheckError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Between getting the filesList and calling stat on the file, the other process may well be already done. In that case the file does not exist anymore which is normal and no cause for concern. As such, this error should be ignored here:
return nil, errorutils.CheckError(err) | |
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @HWoidt,
Actually, when I worked on the fix I tried using LockFileEx/flock/lockedfile and some others.
Some worked for specific OS only and some were deprecated or unmaintained.
BTW, I saw that Go has already accepted exporting part of the lockedfile code but it's not merged yet.
I improved the Stat section and I will try to look again at the timing issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…prove-lock � Conflicts: � common/commands/config.go
case Clear: | ||
return cc.clear() | ||
default: | ||
return fmt.Errorf("Not supported config command type: " + string(cc.cmdType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not supported... --> Unsupported...
@@ -211,7 +234,7 @@ func (cc *ConfigCommand) prepareConfigurationData() ([]*config.ServerDetails, er | |||
return configurations, err | |||
} | |||
|
|||
/// Returning the first non empty value: | |||
/// Returning the first non-empty value: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// --> //
Fix config commands locking mechanism.
Make all config commands that write to config file - atomic.
Also, get lock's file modified time from os, instead of using time.Now() (as suggested by @HWoidt)
should fix #350