Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid redownload by client of locally existing files (eg copied by rsync) #1383

Open
rainer042 opened this issue Aug 20, 2019 · 49 comments
Open
Labels
confirmed bug approved by the team enhancement enhancement of a already implemented feature/code feature: 🔄 sync engine Performance 🚀

Comments

@rainer042
Copy link

Expected behaviour

Tell us what should happen
Files that have been copied via rsync into a clients local nextcloud folder are redownloaded by the nextcloud client altough this should be avoided.

Actual behaviour

I copied a testfile into a nexclouds client subfolder that I disabled for syncing in the nextcloud client before. Then I reenabled syncing again for this subfolder and the file was downloaded again from the server instead of using the local identical copy.

Steps to reproduce

  1. Take a file from your nextcloud server and copy it outside from nextcloud to a client into the nextcloud local folder into the correct directory
  2. Watch what nextcloud client will do when you enable syncing for this directory
  3. You will see it will redownload the file overwriting the identical existing one (the one that has been copied "manually" eg via rsync.

Client configuration

Client version: 2.5.1

Operating system:
Linux, OpenSuE Leap 15.1
OS language: German

Qt version used by client package (Linux only, see also Settings dialog):
Built from Git revision b37cbe using Qt 5.9.7, OpenSSL 1.1.0i-fips 14 Aug 2018
Client package (From Nextcloud or distro) (Linux only):
From distro
Installation path of client:
/usr/bin/nextcloud

Server configuration

Nextcloud version: 16.0.4

Storage backend (external storage):
Disk

Logs

I found issue #3422 from 2017, where this problem was already discussed. However I did not find a solution up to now if there is any.

Personally what I want to do is what I described here:

avoid-complete-nextcloud-resync-to-client-if-data-are-already-on-the-client-via-rsync

Thanks in advance
Rainer

@stale

This comment was marked as outdated.

@stale stale bot added the stale label Sep 19, 2019
@Germano0

This comment was marked as outdated.

@realkot
Copy link

realkot commented Oct 7, 2019

I have the same issue..... Client REdownload already existing files...

@rainer042
Copy link
Author

I still would gladly welcome this new feature to add existing data into the nextcloud sync process. For my use case I eventually chose the indirect path, by removing my nextcloud account, add it again telling nextcloud to keep existing files. This works but its definitively a little cumbersome.

@FriendFX
Copy link

FriendFX commented Jan 14, 2020

For my use case I eventually chose the indirect path, by removing my nextcloud account, add it again telling nextcloud to keep existing files. This works but its definitively a little cumbersome.

@rainer042 this is interesting. Do you know whether those existing files will be synced again if they're changed later on?
I have a slightly different use case: I am installing a different OS on my client machine and plan to manually copy the (previously up-to date) nextcloud folder from a backup once done to avoid having to slowly re-download everything over the internet.
Another use case would be to take a USB flash drive ("USB stick") with all the nextcloud files from one client machine to another before installing the Nextcloud client there to avoid having to transfer everything from scratch.

@rainer042
Copy link
Author

@FriendFX I think if you have installed your new os and then create a new nextcloud account and tell nextcloud that its data directory is exactly the directory with your copy from your nextcloud root folder backup and also tell nextcloud to keep existing data this should work without any problems. In this case ideally no file should be synced initially and if you change any of the files on the client or the server side afterwards a sync would be triggered by your nextcloud client. For me at least this works perfectly.
The USB copy method would probably also work, because it does not matter how the files are copied to the new client if the copy is "one to one" and you create a new nextcloud account afterwards as described above. In this USB scenario you only have to take care that eg the filenames are kept intact exactly as they are. If you copy your files directly on the USB stick not using a tar or zip archive and the USB stick uses FAT as its filesystem this might result in a change of the filenames regarding case like "ReAdMe" and "readme" which could then lead to trouble when you start your first sync.

@haraldkoch
Copy link

I recently updated the desktop client to 3.0.1, and it disabled sync of three of my folders due to their size. They were being synced before I upgraded the client (which is a separate bug). When I enabled sync of those three folders, the nextcloud client re-downloaded all 8Gb of data that was already on my local disk, logging "File has changed since discovery" messages. (Bandwidth isn't free, btw). So this is a sync defect that isn't only triggered by adding files with external software ...

@6pac
Copy link

6pac commented Nov 11, 2020

+1, I have 2TB of local data to sync up to the server folder which is 99% identical. Re-downloading it all is exactly what I'm trying to avoid. I think I'm going to switch to SyncThing, which has a number of other aadvantages.

@PVince81
Copy link
Member

checksum support ticket here: nextcloud/server#11138

that would make it possible for the client to detect that the files are identical and skip syncing

now, without this, maybe an expert option somewhere to tell the sync client to assume that all files in a folder are identical and just retrieve the etag/metadata and update it locally

@PVince81
Copy link
Member

PVince81 commented Dec 12, 2020

I'm also struggling with this. Rsync tells me everything is identical, even timestamps, but the client still tried to redownload everything.
Am wondering if it would be possible to manually insert all entries in the sync db with etags retrieved from the backend (with an external script, as a one shot operation).

@6pac
Copy link

6pac commented Dec 13, 2020

Yes, it look like it's not the file dates/checksums, it's whether they are in the local db. There doesn't appear to be any officially supported way of getting them there. You could possibly do that insert, but I suspect that by the time you do that, you could almost have created a PR ;-)

Anyway. I don't have time for that, so I am uysing Syncthing which seems much better suited for a one way sync and appears to be much more stable and mature.

@PVince81
Copy link
Member

I actually managed to make it recognize existing files by deleting the local sync folder config but with keeping the files. Then I added back the sync folder and pointed the config at the existing data. It seems to have recognized existing files by mtime/size as it didn't resync all of them. So at least that one is a possible workaround.

@er-vin
Copy link
Member

er-vin commented Dec 14, 2020

I'm also struggling with this. Rsync tells me everything is identical, even timestamps, but the client still tried to redownload everything.

For the record, it checks mtime, etag and inode. So if the inode changes it'll assume something changes locally.

@NicolasGoeddel
Copy link

The inode is a bad metric for that. I am just experiencing the same issue after I rsync'd my files from my old PC to my new one. Now the Nextcloud client (V 3.1.3) redownloads all files again although they are 100% identical. Well, at least I have only 32GB of data. But it' still not necessary.

@mcakircali
Copy link

Welcome to another absurd issue. Re-invent the wheel but as a square; inode + assumption = genius.. Prompt the user if you are not sure. This issue still exists after almost 2 years.

@6pac
Copy link

6pac commented Apr 20, 2021

Another plug for SyncThing. I have been using it for nearly 6 months now and it's been fantastic. You can control it (eg. timed pause/resume) from the command line via REST, you can set up one or two way shares, you only need to do port redirection at one end.

@Babyforce
Copy link

This problem also happens when I add stuff to a synced folder on Windows and I reboot to Linux. The client on Linux re-downloads what I uploaded with Windows. This is extremely annoying as my internet also isn't very fast. I tried deleting the database created by the app after exiting it, but it would still re-download after checking the files. I'm becoming tired of having to deal with this client. I heard some scary things about SyncThing but I think I'll try that instead anyway. There is no way I keep re-downloading everything after each reboot.

@gpillay
Copy link

gpillay commented May 9, 2021

Wow. I just came across this issue when looking to avoid unnecessarily downloading data that is already on a new Windows client. This is a major flaw and will likely force me to look at SynchThing to avoid this issue.

@Babyforce
Copy link

Wow. I just came across this issue when looking to avoid unnecessarily downloading data that is already on a new Windows client. This is a major flaw and will likely force me to look at SynchThing to avoid this issue.

I started using SyncThing instead and it works much better than the Nextcloud client does. I recommend it.

@Bubsbilby
Copy link

I started using SyncThing instead and it works much better than the Nextcloud client does. I recommend it.

How did you go about doing this, with issues like permissions and scanning?

@Babyforce
Copy link

I started using SyncThing instead and it works much better than the Nextcloud client does. I recommend it.

How did you go about doing this, with issues like permissions and scanning?

I just use the "SetGID" (2xxx) option along with a chmod of 775 on the main folder, and I also changed the umask of the syncthing user to 775. You could do the same for the webserver user (www-data in my case) but I'm not a fan of that so the only way I found is setting a cron task that does chmod 775 for the whole folder and its subfolders every once in a while...

@JSchimmelpfennig
Copy link

I'm having the same problem as well. My nextcloud is 1TB large and is my offsite backup. I always have a local backup of my laptop (Veeam for Windows) and copied all files of my nextcloud user to my new notebook. But Nextcloud Desktop Client for Windows (Version 3.2.2) wants to redownload everything - although everything is still there.

How is the status here? How can you transfer all user files which are in the nextcloud user directory from one computer to another without redownloading?

Thank you!

@FlexW
Copy link

FlexW commented Jun 21, 2021

@unpairedliabilitylibrarian At the moment there is no other way than to download everything again.

@github-actions
Copy link

This bug report did not receive an update in the last 4 weeks. Please take a look again and update the issue with new details, otherwise the issue will be automatically closed in 2 weeks. Thank you!

@Milokita
Copy link
Contributor

Milokita commented Aug 13, 2021

According to my test, the latest owncloud desktop client has implemented this feature and it works with nextcloud server. @gpillay
So maybe it's time to catch up?

@juliusvonkohout
Copy link

According to my test, the latest owncloud desktop client has implemented this feature and it works with nextcloud server. @gpillay
So maybe it's time to catch up?

Maybe if you copy the whole folder including the hidden nextcloud synchronization state files and use the same nextcloud instance. I had to switch to another nextcloud instance and it did not work.

@Milokita
Copy link
Contributor

According to my test, the latest owncloud desktop client has implemented this feature and it works with nextcloud server. @gpillay
So maybe it's time to catch up?

Maybe if you copy the whole folder including the hidden nextcloud synchronization state files and use the same nextcloud instance. I had to switch to another nextcloud instance and it did not work.

In my test, I removed the sync connection on NC client after it synced with server and then add the sync connection using OC's client. OC would skip the files while NC does not.
As the DB files would be removed when the connections is removed. So the state files does not exist.
The best solution would be fork the code from OC

@juliusvonkohout
Copy link

Ah sorry I overread that you are using the owncloud client too.

@mrmatteastwood
Copy link

Hello Nextcloud team, I would really appreciate this feature since I am planning to sync over 500 GB of data between my desktop and laptop. All the data is already present on both computers. I'd very, very much prefer not to download all 500 GB to my laptop again after setting up NC on my desktop, when the local data is already identical.

Is anybody working on this issue? Is there a fix planned?

@mrmatteastwood
Copy link

mrmatteastwood commented Aug 19, 2021

In the meantime, I have a question... Would a workaround be to copy the .sync_########.db files from PC A (already synced) to PC B (to be connected to NC, but already has all the data locally)? Are there any dangers associated with this?

EDIT: Tried. This seems to work.

WORKAROUND

Scenario: you have fully synced a few folders on computer A to a fresh NC instance. (in my case, each of those folders had its own sync connection, screenshot: https://paste.pics/DL1EM). Now, you have all the same data already locally on computer B and you want to connect it to the same NC server without it redownloading all that data.

Steps (this is how I did it, from the beginning, with a brand new NC server):

  1. Set up NC client on computer A. Skip folder connections during setup.
  2. Open Settings in NC client and add new folder connection(s) for the folder(s) you want to sync (e.g. like this: https://paste.pics/DL1EM)
  3. Wait for all files to be synced to server
  4. Manually copy local folder from computer A to computer B
  5. Go to local folder on computer A and copy .sync_############.db, e.g. to a USB key
  6. Install NC client on computer B as in steps 1-2 above
  7. Wait for it to compare files and start syncing
  8. Pause sync ("..." icon next to folder connection) and close client
  9. Go to local folder on computer B and move .sync_############.db somewhere else for safekeeping/in case of problems
  10. Rename .sync_############.db from computer A to .sync_############.db from computer B and place in computer B's folder
  11. Restart NC client and resume sync
  12. After a moment, it should find all files are identical and be happy.

Notes:
Step 10: in my case, the .sync_############.db on computer A and B actually already had the same file names. YMMV.

Questions:
To anybody: This looks safe and seems to work perfectly for me. The folder I tried this with weighed 19 GB, and after overwriting the db on computer B with the one from computer A, it only synced a few files that I knew were different between both PCs (in my Thunderbird account). Since then, syncing new files and changed files has been working well.

BUT: again, am I missing any hidden dangers here?

@PVince81
Copy link
Member

there might be an easier way, this is what worked for me a few months ago: #1383 (comment)

basically on the target machine start with an empty config, and then add an existing folder that points at the remote NC instance and also the local sync folder that has the data already in it.
make sure the local data was copied with "rsync -rav" to make sure the timestamps are the same

I believe there might be some special logic when initially adding a sync folder so it makes the effort to recognize existing data. It seems that logic is not active when doing a regular sync with existing configs, so the key is to re-add the sync folder to the config.

@mrmatteastwood
Copy link

there might be an easier way, this is what worked for me a few months ago: #1383 (comment)

basically on the target machine start with an empty config, and then add an existing folder that points at the remote NC instance and also the local sync folder that has the data already in it.
make sure the local data was copied with "rsync -rav" to make sure the timestamps are the same

I believe there might be some special logic when initially adding a sync folder so it makes the effort to recognize existing data. It seems that logic is not active when doing a regular sync with existing configs, so the key is to re-add the sync folder to the config.

Interesting! So what you're proposing is this?

  1. On computer B, do a first-time install of the NC client, skip folder config
  2. Manually add folder connection through settings, pick local folder that already has all the same data as the target folder on the server
  3. When it starts syncing, pause sync and close NC client
  4. Delete these files: https://paste.pics/DL2KU
  5. Re-open NC client

Am I understanding that correctly?

@PVince81
Copy link
Member

No, as far as I remember it was even easier:

  1. On computer B, do a first-time install of the NC client, skip folder config
  2. Manually add folder connection through settings, pick local folder that already has all the same data as the target folder on the server (rsynced so they have the same timestamps)
  3. Tell it to keep the data and wait
  4. After a while it will have recognized that all the files are the same and will end syncing

In my experience back then with about 1 TB file it still synced about 20 GB of data for whatever reason, but left everything else alone.

I don't think you need to delete the dot files because you anyway did not have any config.
From my understanding it's the "keep the data" option that will do the expected magic.

@mrmatteastwood
Copy link

mrmatteastwood commented Aug 19, 2021

Ahhh, gotcha. I think step 3 is only possible, though, when setting up the folders during the initial setup, not after the client has been configured once. If I open Settings and add a new folder connection, it doesn't ask me whether I want to keep local data (I'm on Linux, might be different on other OS clients).

My setup is kinda specific. I only want to sync three folders in the root of my ~/home folder. I make 3 individual folder sync connections for them because I've burned my fingers before by selecting the whole ~/home folder to sync and de-selecting the folders I didn't want. One wrong checkmark and you might end up deleting local stuff.

So it's a catch-22. If I do the folder connections during initial setup, I can tell it to keep local data. But I can only set up one folder sync. If I want to add more folder syncs, I need to do it later, where I don't get the option to keep data...

@mgallien
Copy link
Collaborator

No, as far as I remember it was even easier:

1. On computer B, do a first-time install of the NC client, skip folder config

2. Manually add folder connection through settings, pick local folder that already has all the same data as the target folder on the server (rsynced so they have the same timestamps)

3. Tell it to keep the data and wait

4. After a while it will have recognized that all the files are the same and will end syncing

In my experience back then with about 1 TB file it still synced about 20 GB of data for whatever reason, but left everything else alone.

I don't think you need to delete the dot files because you anyway did not have any config.
From my understanding it's the "keep the data" option that will do the expected magic.

So if you select "Erase local folder and start a clean sync" then the local sync folder will be emptied before trying to sync otherwise the client will try to sync to the best of its abilities (meaning that if I understand correctly the files that are identical will not be downloaded again)
might require that the files have been uploaded from a client that is also pushing a content hash for the uploaded files
image

@PVince81
Copy link
Member

pushing a content hash for the uploaded files

how to verify this ? I guess checking in oc_filecache to see if there's a hash in the column ?
I suppose this could explain why people might be seeing different results, but not sure if everyone tried the procedure I mentionned above.

@howie-j
Copy link

howie-j commented Aug 20, 2021

pushing a content hash for the uploaded files

how to verify this ? I guess checking in oc_filecache to see if there's a hash in the column ?
I suppose this could explain why people might be seeing different results, but not sure if everyone tried the procedure I mentionned above.

Which OS are you using? I do not get this option when adding folder sync on Fedora.
sync

@PVince81
Copy link
Member

pushing a content hash for the uploaded files

how to verify this ? I guess checking in oc_filecache to see if there's a hash in the column ?
I suppose this could explain why people might be seeing different results, but not sure if everyone tried the procedure I mentionned above.

Which OS are you using? I do not get this option when adding folder sync on Fedora.
sync

in my case it was openSUSE Leap 15.2 back then.

I think you'll get the option after you have selected the folder, because the client will detect that the dotfiles database are present there. (don't delete them!)

@ArtificialImagination
Copy link

ArtificialImagination commented Apr 2, 2022

We have a slightly different situation but with the exact same problem. We are using GitHub Desktop + LFS to sync our Unreal project across multiple locations. To save on bandwidth we copy the project over to a new machine then link the repo on that machine. However when we do this it seems to download every single file again as described. Our project is usually 50-100gb and sometimes needs to be shared across 5-10 machines, so the bandwidth stacks up pretty fast.

Still seems like there isn't actually a solution to this? any help would be greatly appreciated

@mrmatteastwood
Copy link

We have a slightly different situation but with the exact same problem. We are using GitHub Desktop + LFS to sync our Unreal project across multiple locations. To save on bandwidth we copy the project over to a new machine then link the repo on that machine. However when we do this it seems to download every single file again as described. Our project is usually 50-100gb and sometimes needs to be shared across 5-10 machines, so the bandwidth stacks up pretty fast.

Still seems like there isn't actually a solution to this? any help would be greatly appreciated

My solution from last August has been working for me since then:
#1383 (comment)

@PVince81
Copy link
Member

I had an OS partition crash last week so had to reinstall the OS and also lost the Nextcloud client config in the way.
The local data was still present.

When I reconfigured the Nextcloud desktop client from scratch and pointed it at the existing data (1 TB+), it did not redownload everything. The way I did it is to skip the wizard and then configure the sync folders manually later on.

I did see some downloads but I believe it's because I forgot to exclude folders from selective sync, some I had excluded before the crash.

@BurningTheSky
Copy link

My solution from last August has been working for me since then: #1383 (comment)

This didn't work for me, it scans for changes then says it needs to download all the files again.

@Torashin
Copy link

Torashin commented May 6, 2022

I'm also having this issue, except it appeared out of absolutely nowhere. 3.4TB all synced up then NC client randomly decides it needs to re-download it all over again.
That obviously isn't happening over the internet, and this is far from the first time I've had issues with the NC client, so switching to an alternate sync utility it is. Since a number of you have had success with syncthing I'll give that a go!

@Torashin
Copy link

Torashin commented May 9, 2022

A quick follow up... I wasn't a fan of syncthing - it also tried to re-download hundreds of GBs of identical files. Resilio Sync, however, is working very well so far. My only gripe is that it's quite limited in configuration options, but hopefully the default settings you're stuck with continue to work OK.

@mrmatteastwood
Copy link

mrmatteastwood commented Nov 9, 2022

Still working workaround, verified today with NC 3.6.1

Hey folks, I needed to do this again today as I moved to a different managed NC provider (Hetzner, from IONOS). So I used NC client on my desktop PC (computer A) to upload all my stuff, about 500 GB to the new server. Now, I'm hooking up the NC client on my laptop PC (computer B), where all the data is already present.

With this point of departure, the following steps still work reliably for me:

  1. Get .sync_############.db (e.g., .sync_2147848929c8.db) from synced local folder on computer A
  2. Start NC client on computer B, skip folder sync setup during initial config
  3. Open Settings screen and click "Add Folder Sync Connection" button
  4. Hook up your local folder to the remote folder
  5. Wait until NC starts downloading files (not checking for changes, actually downloading), then immediately close the client
  6. In your local folder, delete the dot files that the client just created (ex.: https://paste.pics/JPK8B)
  7. Put .sync_############.db from computer A into that folder (in my experience, the file name is identical)
  8. Restart NC client
  9. Rejoice

It'll re-sync the few megabytes it downloaded before you closed it in step 5, then you'll be fine.

EDIT: video: https://youtu.be/nS8XpbTS928

@codeling
Copy link

codeling commented Sep 3, 2024

Since comparing checksums is implemented now for specific cases, couldn't this be extended to cover this use case too - when two files have the exact same size, even if their mtime is different, do a hash comparison, and only re-sync if the hash differs. This effectively would probably mean that inode and mtime data would then only be used for change recognition, but not for difference recognition.

@github3845
Copy link

This is kind of important with external storage, because it always downloads the entire external storage after I unchecked and re-checked it in the client. With many S3 providers, egress is not free. Although I guess it would be financial suicide to enable external storage with nextcloud without unlimited egress, since who knows how many "transactions" it will produce that may be billed, despite no real data transfer occurring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed bug approved by the team enhancement enhancement of a already implemented feature/code feature: 🔄 sync engine Performance 🚀
Projects
None yet
Development

No branches or pull requests