-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FileCache assertion on previous copy attempt age triggered #1643
Comments
After #1644 was merged now, please check whether this problem still occurs. |
This has not occured again on the same problematic node. |
Another instance of this error has occured w/ RETURNN 8441d07, on a node where the local disk is close to full:
Furthermore, again it seems a lock file was removed by another process:
|
Do we log when we remove the lock file? E.g. log why it is being removed, the recent mtime, etc. |
Yes:
We don't log the mtime, but we have set the lock timeout in the file cache to 20s by default. |
WDYT on adding another check to |
See #1663 |
I don't like this. But also, I don't really understand the issue. How does this actually happen? Why is the mtime not reliable? How can that be? And we should be able to make this reliable, or not? So then, no other weird workaround is needed. |
A (very) slow FS could be one reason the mtime is unreliable, no? In that case I don't know if there is anything we can do to make it reliable besides increasing the lock timeout. And then this decreases the usability in the good case, where there isn't any FS slowness but an actually crashed process that has left behind stale lockfiles. |
But we are writing to a local disk here? I expect maybe that this is busy for a few milliseconds, not more. This cannot explain that mtime is unreliable. Maybe some weird FS flags while mounting the disk could cause this? But I don't think so. So, I highly doubt that this is the problem. To me it looks like some bug somewhere. Before we add any further hacks/workarounds, we should understand what goes wrong here. |
Another part of the error message is
I think this is because the _TouchFilesThread runs I/O operations while iterating the This is probably the issue here. The touch files thread crashes and no longer updates the mtimes. 🤷🏼♂️ |
We also get:
One process acquires the lock file but then seems to fail to update the mtime frequently enough for the others to wait. Another process then deletes the lock file and starts copying by itself as well, leading to the assertion triggering because the other process is still copying the data.
This is probably due to FS slowness. I'm not sure what to best do here. We can
In any case we can add the lock file to the touch files thread immediately after acquiring it so that the gap between acquiring the lock and trying to update its mtime is as small as possible and includes no FS operations in between that could cause additional delays.
The text was updated successfully, but these errors were encountered: