-
Notifications
You must be signed in to change notification settings - Fork 668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.7.0-beta3 nightly: serious inconsistency and lost file when remote folder renamed while files uploaded to it #2296
Comments
Please also note that we have recently had a similar case from a user (albeit with older server/client versions so I won't bother you with that). But the point is that this is a real problem. |
I was not able to reproduce the problem here with the last git from 1.7 on linux. Regarding the step "3.": do you rename on the web interface or on the client's file system? Maybe you can provide the full log? |
Step3: rename remote folder via the web page. I was able to reproduce it every time. |
It is possible that we have should_update_etag set to true for files that we also need to propagate. In which case we must not write to the DB too early as this could cause data loss. (cf: issue #2296)
I tried and sometimes it works as expected and sometimes it does not. But there is still something fishy hapenning sometimes on the server, in which the server reports a 500 Internal Server Error and do not keep properly the file id, which lead that we duplicate some of the files (but we should not loose data) |
This is log output in the server log that happens if there is a 500 server error:
@DeepDiver1975, @PVince81 can you guys pls. help? |
Hello, I don't think that occasional 500 from the server for few files is the main problem for this issue (it is confusing for the user but should not critically affect the client logic). The upload starts and initially succeeds. Then remote folder is moved so the subsequent PUT requests from the client fail with 404. That's OK because the target path does not exist anymore. [[ At this point I would propose that after few 404s for the PUT you check for existence of the remote parent folder. Like this you can avoid many useless PUT requests in to the remote tree that does not exist anymore. ]] However, on the next sync run, for some reason you do not detect that the folder was moved. If you correctly detected that you'd then resume PUT requests to the new remote path and everything would be fine even with for those files for which you got 500 before. BTW. The log entries are confusing: they say "Error downloading...." for the "Up" direction. Could you please fix that? Or does that indicate another, deeper problem? |
I agree with @moscicki. But also some of the warnings should be looked into like the invalid date format and make such case fail more gracefully without triggering such warnings. CC @schiesbn for the files_versions warning. |
The client had a bug (fixed with commit 9b178c5) That led it believe that some file would have been uploaded sucessfully while there was an error in some rare cases. This bug could have triggered the dataloss that moscicki saw. But hopefully it is now fixed and there should no longer be data loss. However, we cannot detect a move if the fileid are not kept on the server. |
(Regarding the "Error downloading" message, this string comes from Qt. We are trying to fix Qt to change the wording of the message: https://codereview.qt-project.org/83041/ ) |
Is this moving issue fixed in the latest nightly? Should I try my test again? |
I tried against owncloud7.0.1 and client nightly build 1.7.0-20141020 on MacOSX. It does not work as expected. Very similar behaviour as originally described (including file loss). I also tried the same test against 1.6.3 -- the exact trace of operations is different but the result equally bad - lost files and full inconsistency. I guess there is something fundamentally broken here. Could you please check this thoroughly? |
…ved on the server Found while investigating #2296 The problem is that we should not remove a directory locally if it contains modified files. But the modification time of the directory is not necessarily chaning (so the instruction of the directory may still be NONE) We have to move the child_modified test a bit down to be recursive
One of the problem is that the server somehow seem to only keep only some of the file id, but many file ids are new |
I think in this model you should not use file id for anything but tracking remote moves. Otherwise you will be in trouble... |
That's exactly what we do. |
Same problem (file loss) on 1.7.0-beta4 |
I tought I fixed all the loss file problem. |
@moscicki can you again elaborate on this please? 1.7.0 release is coming nearer... |
Also please provide the client log. |
I was able to reproduce the file loss. I put files f_a.log to f_z.log into a directory and let them sync. After that, I touched all files in the dir source and renamed the directory on server side. The files a_a.log to a_e.log got lost, all others ended up in the old directory called source. In the server log file I find this entries:
More client log files can be found on s3. |
The client log show that the files are gone on the server. So it's the server who lost the files. Note that i was not able to reproduce this problem on my server. (7.0.2 sqlite) |
Exactly same thing. I think you are also likely to experience it if you upload the files for the first time (as far as I remember). kuba On Oct 23, 2014, at 5:27 PM, Klaas Freitag [email protected] wrote:
|
@craigpg who can help here? @DeepDiver1975 or @PVince81 ? Thanks! |
Yes, please work with @DeepDiver1975 and @PVince81. |
@moscicki did you rename files in all your test cases while uploading ? Or are you saying that the simplest initial upload is causing the issues ? |
The linked ticket was mostly about downloading and renaming during download, so probably not directly related. |
Trying to reproduce, here are my steps:
Result: there are now two folders on the server, "V" and "V2" |
Regarding the wrong text in the error message ("downloading"), this is a bug in Qt: https://codereview.qt-project.org/83041/ |
Let's have a look at the access log:
After the rename all subsequent accesses get 404:
And here I see that the sync client is doing a MKCOL to recreate the "V" folder and then upload the files there:
Not sure, but @dragotin told me on IRC that second MKCOL shouldn't happen. The sync client should find out that "V" was renamed to "V2" and use that instead. Maybe there's a bug in the order in which the discovery happens ? |
I'll do another test and check whether the etag of "V" and "V2" differ before and after the rename. |
Ok, after renaming and before the second sync cycle, the etag has changed. |
Third test run with Y->Y2: I put a breakpoint in In that test run the etag of the folder did NOT change. This time the sync client could detect the rename and correctly uploaded all files into Y2. So it looks like if the flow is "perfect" in a way that the rename doesn't disturb any of the uploads, then the etag will stay the same. I'll do another run to confirm. |
Ok, confirmed: when the flow is controlled and Next step: find the part of |
The problem is that the fileid of V2 is no longer the fileid that V used to have (at least in the log i've seen) |
@ogoffart never mind, we're past that already 😉 I need to dig into |
I'm observing a behavior I cannot explain. I added debug logging in many places and what I see is that: For some unknown reason, the rename operation not only updates the cache entry of the folder (for the renaming + etag change) but it ALSO updates the cache entry for every file that is already correctly uploaded into that directory. It somehow seems that the "parent" property has changed. This strange update only happens if the upload process is running (race condition). If I rename the folder without uploading anything, only a single cache update is done. Needs further research. |
I think I'm starting to understand what is happening. I did another test run and I observed that the rename operation that renames E to E2 will ALSO run In a normal rename operation, the scanner is NEVER triggered. So the question is, why would the scanner be triggered in this case ? One possible explanation is simply that the mtime of the folder has changed. If a separate process puts a file into that folder it will automatically change the mtime of the folder on the FS-level. A) Normally, if everything runs sequentially instead of in parallel, this is the upload process:
B) And when a rename is performed separately:
If we mix both, the initial checks (file_exists) will trigger code that notices that the FS mtime has changed, and this will trigger a file scan. If at this point the FS mtime of the folder "E" was not updated yet by step 7) from the upload, then the mtime is different and the code considers this as a remote modification: trigger the file scanner. So if the check B) 1) happens between A) 5) and A) 7) it will wrongly trigger the file scanner... I'm still speculating and will dive even deeper to confirm this. |
@icewind1991 would it make sense to prevent folder scanning during a folder rename operation ? |
So, I debugged the sabre connector's
So from what I see the mtime of "G" is changed twice, once when storing the part file and once when renaming it. The FS mtime values always matched the ones in the cache. "file_put_contents()" seems to trigger the file scanner (seen in my logs) to update the cache's mtime/etag based on the FS values. Not sure yet how it updates the parent's cache value. But I suspect that if there's a slow operation there, there might be a small time window where the FS mtime and cache mtime are not the same. Next step: debug into "file_put_contents()" |
I debugged inside "file_put_contents()". There is a very short time window where FS mtime and cache mtime differ. But that window is so short that I doubt it is causing the issue I was able to reproduce easily several times. I'll add more debug statements and hope to find more clues... |
How is that even possible ? The file id of the folder changes after paralell upload+rename:
After:
I understand that the mtime or etag changes, but why the fileid ?! |
@PVince81 : that's pretty hardcore debugging -- sorry I cannot help out much in php. |
Here's a piece of the log, starting with the rename operation (request 544960902fa20):
But what happened with 1983 / the original "H" entry ? Maybe one of the concurrent scan requests found it missing and deleted it ? And another one found the new one and created one ? It almost looks like we'd need some kind of scanner locking... avoid rescanning the same folder if another scanner is running... or something in the likes. |
@moscicki I don't think that would help. Maybe it would just make the error messages from the client look less scary. The main problem is that the folder's etag changes on rename, which makes the client believe that the folder is NOT the same. Goal is to keep the etag after rename (at least the latest version, the one that the client knows of) so that the client knows that it's the same folder but with a different name. I don't see any other way to debug this except using hard-core debugging. Race conditions are hard and only logging makes it possible for me to see what happens... and every time it's not exactly the same that happens (just got a run where the fileid was kept but the etag changed). |
Ok, I can now confirm the reason for the fileid change (test run K->K2) Subdir entry before:
+--------+---------+----------+----------------------------------+--------+------+----------+----------+------+------------+---------------+-----------+------------------+---------------+-------------+ Here is the log, followed by my interpretation:
So basically this proves my theory about the changing fileid. It seems that having concurrently running scanners is a bad idea and likely to cause issues. I have the feeling that there are other possible concurrency cases that could go wrong. So we probably need to either lock at scanner level or at high-level file operation level (high-level as in close to the entry points) @icewind1991 I'd like to hear your input about how to improve this. |
Note: I had a case where the fileid stayed but the etag changed, so it's likely that there are other issues... it all depend on the order of all these operations... |
@PVince81: I agree 409 would not help immediately but it would be according to the webdav specs and at some point maybe client guys could profit from a difference between 404 and 409. For now, agreed, it's a rather cosmetic change. @ogoffart: I tried against our webdav backend on EOS (which apparently does not have this concurrency issue and keeps the fileid stable). So the first time I tried I thought something was messed up (no file loss, though). Then I retried 4 times (both with new and previously known files, see #2296 (comment)) and it seems to work! Unless I fell into lucky race-condition during the 4 retires I would think that indeed the issue has been fixed in the client. If you want for reference, here is the trace of operations with 1.7.0-beta4 on MacOSX: |
I think this bug should not be a release blocker / gold ticket aka showstopper. Note that in my case I can reproduce the issue 90% of the time. So I suggest to remove the gold ticket / showstopper label but keep the ticket to continue investigating. Locking the file scanner and/or file operations is a general thing that needs to be looked into. |
I think it is indeed a server issue now. I can confirm that the client 1.7.0-beta4 works fine (as far as I can test it). Thanks to all for help. I hope this is also useful to keep on improving both client and server. |
@ogoffart: FYI, I also checked the same scenario in the opposite direction i.e. while files are downloaded from remote folder A to local folder A: rename a local folder A to B (on MacOSX) works very well! everything ends up in B and also the remote folder gets renamed to B well done, many thanks! 👍 |
(Note that mirall 1.4 and before relied on the 404 to create the directories) |
I hope you do not anymore since 1.5 - I am giving you back 409 as in the specs... |
I raised owncloud/core#11795 to follow up on the server. |
This is for the latest 1.7.0-beta3 on Redhat6 nightly but I believe it is version and platform independent and possibly server-independent.
I consider this use-case to be very important, especially for the folders shared between multiple users.
What I expected is: all my files uploaded to V2. What I get is a rather messy final state:
So out of 100 files, I was left with 82 files in V, 0 files in V2.
Here is the log of all the actions:
https://gist.github.com/moscicki/5e4b4c8bbfd66fd92958#file-oc-remote-folder-rename-sync-mess
Here is the log of all the actions in case of initially empty folder E on the server (renamed to E2):
https://gist.github.com/moscicki/fc148a5dc6fa3cef9fab#file-oc-remote-folder-rename-mess-empty-remote-folder
NOTES:
I consider this scenario very important if more then one user syncs the same directory at the same time (sharing). However, note that all actions are performed by one only sync client, on a normal oc server folder (not-shared). I have a second client of the same user connected to this repo and passively "watching". The second sync client seems to be completely irrelevant here.
The text was updated successfully, but these errors were encountered: