Allow remote-delta on sync files #16162

gadLinux · 2015-05-07T12:40:08Z

Hi,
I want to fix this improvement ASAP.
owncloud/client#179

It relates to the remote-delta implementation on client. But I suppose it must be supported on both sides. I saw that latest version of csync inludes a module for owncloud but it differs on current implementation of client on github.

in fact I saw that's using a special module that's not on the normal csync library that's called httpbf. It seems for me that's sending the whole file as PUT http?

So my question is. If you will want to implement rsync how will you implement it?

I was thinking about clone a module on csync and make it transfer using remote-delta (surely using rsync lib) instead of sending the whole file in chunks.

What do you think? Any suggestions?

Using httpbf. I can detect what's changed and send only the blocks that are changed. But this would mean extra work for nothing. Because librsync already implements this kind of processing we are not going to do ti twice.

Best regards,

jospoortvliet · 2015-05-22T09:59:34Z

@danimo @dragotin @ogoffart @guruz could you guys give some input here perhaps?

ogoffart · 2015-05-22T10:20:09Z

Hi @gadLinux , thank you for trying to help on this issue.

The client is no longer using httpf. And is using only very few parts of csync.

You need to contribute in the client repository. (and in the core)

You will find the code that uploads files in src/libsync/propagateupload.cpp here you will find the code that split the file in chunks and uploads them.
Similarily, the code that download the files is in src/libsync/propagatedownload.cpp (this is not using chunks)

Good luck :-)
But be warned that i think it is a difficult issue. I don't really know if it is possible to do that on top of webdav so you will probably need some kind of extensions to the protocol.

dragotin · 2015-05-22T10:45:42Z

Some brainstorming:

There needs to be a list of blocks with its according checksums stored on the server for every file. It will be accessible through a special route on the server API.

The client will need to fetch the list once it detected that an existing file has changed and is to be uploaded. Once the blocklist has arrived, the client needs to recalculate the blocklist on the new, changed file. After that, the client can compare the lists and identify the blocks that have a changed checksum. These are the blocks that need to be uploaded.

To upload only parts of files, maybe the HTTP PATCH command (RFC) helps. The server needs to be able to handle this command and reassemble the whole file.

For files that appear new on the client, the client will have to calculate the blocklist and send it along the initial upload to the server to avoid having the server to calculate the list. Also, for each uploaded block, the client will send the new checksum along. The server will either recalculate or invalidate the blocklist for files that were changed on third party storage.

The client needs to be able to deal with the fact that the server can not provide a blocklist for a file and will transparently fall back to uploading the entire file.

The kind of checksum is configurable, and will be adjustable by a server configuration later.

danimo · 2015-05-22T12:09:56Z

A good starting point for a rsync-like approach using HTTP is zsync, aka client-side rsync. I don't think using librsync is feasible, as we want to stick to HTTP(S) as transport protocol.

gadLinux · 2015-05-25T09:35:18Z

Hi Olivier,

Thank you for the feedback. I will see what parts are in use on csync.
Because I have thinking about two approach. One is to implement the
rsync protocol to transfer the file, and the second one is use something
like bitttorrent. So it can transfer only the parts needed for it but
also leave doors open for multiple client sync against same file in a
decentralized way.

I suppose that the first way is easier and better for now.

Let me check the code.

What's the most recent branch of development in both projects? I found
there are many and many commits on each branch.

Best regards,

El 22/05/15 a las 12:20, Olivier Goffart escribió:

Hi @gadLinux https://github.com/gadLinux , thank you for trying to
help on this issue.

The client is no longer using httpf. And is using only very few parts
of csync.

You need to contribute in the client repository. (and in the core)

You will find the code that uploads files in
|src/libsync/propagateupload.cpp| here you will find the code that
split the file in chunks and uploads them.
Similarily, the code that download the files is in
|src/libsync/propagatedownload.cpp| (this is not using chunks)

Good luck :-)
But be warned that i think it is a difficult issue. I don't really
know if it is possible to do that on top of webdav so you will
probably need some kind of extensions to the protocol.

—
Reply to this email directly or view it on GitHub
#16162 (comment).

danimo · 2015-05-25T09:40:44Z

@gadLinux Please stick with HTTP. Other protocols are not usable in certain scenarios (Port 80 and 443 are always open, others often are not). On top of that, another advantages with the HTTP-based zsync approach is that the server can remain mostly passive, which shifts a lot of load to the client. This improves scalability.

gadLinux · 2015-05-27T10:50:48Z

Hi,

So it's important to reimplement solution over HTTP. Okay, I hope I can
do it this weekend or so.

Did you think about using thrift?

Best regards,

El 25/05/15 a las 11:40, Daniel Molkentin escribió:

@gadLinux https://github.com/gadLinux Please stick with HTTP. Other
protocols are not usable in certain scenarios (Port 80 and 443 are
always open, others often are not). On top of that, there are a number
of other advantages with the zsync approach (with sits on top of
HTTP), is that the server can remain mostly passive, which shifts a
lot of load to the client. This improves scalability.

—
Reply to this email directly or view it on GitHub
#16162 (comment).

ogoffart · 2015-05-27T11:06:38Z

It does not have to be HTTP, but it would be better. (The configuration of an owncloud server has to remain simple). Anyway if it doesn't work we can still fallback to the normal method.

Anyway, as I said, we are slowly moving away from csync, and new code on the client should be written in C++ in the libsync directory.

Bittorrent-like protocol would be something different, that opens its own cans of worms (how do you do authantication or security if you are peer to peer?)

danimo · 2015-05-27T11:26:49Z

and the second one is use something like bitttorrent. So it can transfer only the parts needed for it but also leave doors open for multiple client sync against same file in a decentralized way.

Do not try to solve two problems at once. The basic idea behind ownCloud is that the server stays in control of the data, so p2p would involve getting a server authorization first. This requires oauth, which is scheduled for one of the next ownCloud server versions. The p2p implementation would then be purely client-side, and not involve the server at all, except for the authorization request.

gadLinux · 2015-05-28T06:46:27Z

and the second one is use something like bitttorrent. So it can transfer
only the parts needed for it but also leave doors open for multiple
client sync against same file in a decentralized way.

Do not try to solve two problems at once. The basic idea behind
ownCloud is that the server stays in control of the data, so p2p would
involve getting a server authorization first. This requires oauth,
which is scheduled for one of the next ownCloud server versions. The
p2p implementation would then be purely client-side, and not involve
the server at all, except for the authorization request.

Yo are right, I agree. Let me take a look to the partial file sync problem.

jospoortvliet · 2015-05-28T11:24:24Z

@gadLinux perhaps it makes sense to have a Skype call about this to avoid a lot of work going in a direction that won't work ;-)

rullzer · 2015-06-30T07:37:44Z

Ok so I have been looking into this and reading up on zsync (and by extend rsync). And I think this is something that can be done. Luckily the zsync library is available so I'm modifying that to get a POC working. Since at least the upload use case is not what is was designed for.

I'll hopefully soon put that work on github. So stay tuned. But let me here (quickly) write down how I think this should work.

Assumptions

This only covers the upload scenario (download will follow later, but is easier since it is basically the part zsync tries to solve already)
The server has a copy of file F
The client has a modified copy of file F, lets call it F-new
The server already has the checksum file of F, let call it F-zsync

Setup

Server

The server has only 2 tasks. Store the -zsync files. And generate the new file. This is to keep things scalable and not have the server do all the checksumming.

We need to come up with a protocol to tell the server how to assemble the new file. I'll be thinking about that.

We do require space for this to work since you need to copy the original file. And only copy it back once everything is done. Probably locking is also a good idea.

Client

In the normal use case of zsync the client figures out which parts are needed from the server. Now we need to find the parts in common with the server. Which parts needs to go. And which parts need to be added. This requires some new code but the info should all be available.

Since the server is not doing much here all the computational load is shifted to the client. Also in the current setup we trust the clients to generate the zsync files (this is hell in PHP) and send them. From my point of view this is fine since any client can just do a PUT request now anyway.

I hope to have a POC (in C so easily portable to the client) soonish.

CC: @danimo

powerpaul17 · 2015-06-30T09:35:26Z

I think it should work this way both for the upload and the download because we already know which file is the "newer" one. The client calculates the parts which have to be transferred and the direction is then depending on the sync status.

For the upload there should be something like a HTTP PATCH request.

gadLinux · 2015-06-30T09:54:23Z

Hi,

Looks great! Finally I didn't have time to go into deep. But if you fail
or need help please let me know.

Best regards,

El 30/06/15 a las 09:37, Roeland Douma escribió:

Ok so I have been looking into this and reading up on zsync (and by
extend rsync). And I think this is something that can be done. Luckily
the zsync library is available so I'm modifying that to get a POC
working. Since at least the upload use case is not what is was
designed for.

I'll hopefully soon put that work on github. So stay tuned. But let me
here (quickly) write down how I think this should work.
  Assumptions
This only covers the upload scenario (download will follow later,
but is easier since it is basically the part zsync tries to solve
already)

The server has a copy of file F

The client has a modified copy of file F, lets call it F-new

The server already has the checksum file of F, let call it F-zsync

Setup

diagram1
https://cloud.githubusercontent.com/assets/45821/8425844/59f2de74-1f0a-11e5-97ba-74444a586d34.jpg
  Server
The server has only 2 tasks. Store the -zsync files. And generate the
new file. This is to keep things scalable and not have the server do
all the checksumming.

We need to come up with a protocol to tell the server how to assemble
the new file. I'll be thinking about that.

We do require space for this to work since you need to copy the
original file. And only copy it back once everything is done. Probably
locking is also a good idea.
  Client
In the normal use case of zsync the client figures out which parts are
needed from the server. Now we need to find the parts in common with
the server. Which parts needs to go. And which parts need to be added.
This requires some new code but the info should all be available.

Since the server is not doing much here all the computational load is
shifted to the client. Also in the current setup we trust the clients
to generate the zsync files (this is hell in PHP) and send them. From
my point of view this is fine since any client can just do a PUT
request now anyway.

I hope to have a POC (in C so easily portable to the client) soonish.

CC: @danimo https://github.com/danimo

—
Reply to this email directly or view it on GitHub
#16162 (comment).

jospoortvliet · 2015-06-30T10:12:09Z

Just curious - does this play into this and might it be interesting to make sure it can work with it?

https://dragotin.wordpress.com/2015/06/22/owncloud-chunking-ng/

DeepDiver1975 · 2015-06-30T10:29:59Z

@dragotin and myself have been talking about delta-sync as well - there are 2 more blog posts to be expected ...

rullzer · 2015-07-01T20:04:36Z

@powerpaul17 sure it should work like this both ways. But the download case is easier. Since the zsync file tells you what has to be done. you can download using http range requests. And all is well. Some form of HTTP PATCH would indeed be best for the upload case.

@jospoortvliet this should be independant of that. Altough it could (and should) still make use of chunking if there are a lot of changes.

@DeepDiver1975 awesome looking forward to that.

So my code now "works"... at least on the limited set of tests I've thrown at it. It is inefficient. But works. And hopefully will be documented enough soonish to be available for discussion.

warnerbryce · 2015-07-13T21:16:01Z

Hello, i'm really interested about testing your code @rullzer
Would you like to share it with me about having more feedback and testing ?

rullzer · 2015-07-30T08:27:33Z

Sadly no reply yet from the zsync devs (regarding the licence). But here is some code.

A very simple POC owncloud app: https://github.com/rullzer/deltasync
A very simple (yet messy) C++ application to upload a file using deltasync: https://github.com/rullzer/deltasync_client

If I have some time soonish I'll update with steps on how to test.

gadLinux · 2015-07-30T08:35:33Z

Hi,

How do you avoid the deftasync file to be syncronized?

Best regards,

El 30/07/15 a las 10:27, Roeland Douma escribió:

Sadly no reply yet from the zsync devs (regarding the licence). But
here is some code.

A very simple POC owncloud app: https://github.com/rullzer/deltasync
A very simple (yet messy) C++ application to upload a file using
deltasync: https://github.com/rullzer/deltasync_client

If I have some time soonish I'll update with steps on how to test.

—
Reply to this email directly or view it on GitHub
#16162 (comment).

rullzer · 2015-07-30T08:39:45Z

O you don't currently.
As I said just a POC.

karlitschek · 2015-07-30T13:36:29Z

@rullzer Very nice. @dragotin @DeepDiver1975 FYI

agowa · 2015-09-30T01:34:29Z

Wondering if rfc3253 is related to this problem.
It describes a WebDAV extension for versioning.

lavrentivs · 2015-12-14T15:40:52Z

For large amounts of files I believe that something in the line of https://en.wikipedia.org/wiki/Merkle_tree might be a good idea.
I had an instance of owncloud running on a really tiny machine (ie, VPS with 512 MB of RAM and 1 core) with a lot of files (160445) and the initiation of the client sync was horribly long. From reading the solutions proposed here I'm afraid that verifying what needs to be updated would still take a long time.

If such a tree structure were used it would suffice to compare two hashes to know that nothing changed. If a single chunk was to changed, it could be found in logarithmic time.

This kind of structure would increase the round-trips so it might need a bit of tweaking. Maybe sending 1k nodes instead of just one would improve it.

gadLinux · 2016-02-06T00:38:19Z

Hello again. @jospoortvliet I left this because it seems that @rullzer was already implementing something. I'm still interested on it. What's the status of this issue?

@lavrentivs About the merkle tree. We are using this in one of our platforms but maybe this can apply here. I don't know if this would work for a large amount of files or just for big files. I will ask the one tha implemented this here.

Anyway. I saw a lot of improvements in server and client since my last publish but this seems to be still stalled. Should I go or do I wait for @rullzer final solution?

rullzer · 2016-02-06T06:38:05Z

@gadLinux My approach has a licencing issue wich I'm trying to figure out.

Also implenting this as is is not so trivial since we need proper server side support to do delta uploads/downloads. Which is currently not there. And I prefer to have that eventually in in a nice way instead of hacking it in.

That being said. If you think you have a nice approach please go ahead. :)

gadLinux · 2016-02-06T23:29:14Z

@rullzer I'm a totally newbie on owncloud specific stuff. So no. I don't have a nice approach. But when I have something cool to implement there's no barriers... :-)

I took a look to your implementation. And got this:

./uploadclient README.md.zsync README.md http://owncloud-dev /files/readme admin admin
READING README.md
0
DONE READING
[]

Started delta sync
Violación de segmento (`core' generado)

Maybe doing something wrong...

The licensing issue could be a problem. But I don't understand much about this kind of stuff.

Can I ask why zsync? Why not rsync directly or Unison?

I'm still reviewing architecture. And sincerely I thought you already had the solution for it since it looks nice. But I will tell you if I work on something.

hodyroff · 2017-11-03T09:02:11Z

Question: Why enable delta-sync for larger files only? Meaning: What is the reason behind this configuration parameter?

ahmedammar · 2017-11-03T13:29:20Z

@hodyroff it's client configurable because there is inherent cpu over-head to do a zsync upload/download, so it might not make sense for small files, also currently blocksize is hardcoded to 1MB, so enabling on files smaller than that just doesn't make logical sense. N.B. Setting to 0 will enable for all.

The basic approach is to store zsync metadata files in a folder called `files_zsync/` which stores them based on fileid. These metadata files can be requested by the client via a new route `dav/files/$user/$path?zsync`. They can also be deleted using the same route. This is implemented using a new `ServerPlugin` called `ZsyncPlugin`. Filesystem hooks are used to mirror any `copy/delete` operation on the base file or containing folders onto the metadata files. To ensure any changes server-side changes are will not produce out-of-sync metadata. The upload path is implemented by creating a new plugin `ChunkingPluginZsync`. The chunk file ids are now assumed to be named as the offsets into the original file. Special handling is done when a chunk named `.zsync` is found which is the generated client-side metadata. This means copying the contents to the `files_zsync/` folder. The core reason behind this is to ensure that both the metadata and the file are put in place atomically, as part of the final `MOVE` request. The implemenation adds a new class `AssemblyStreamZsync` which extends `AssemblyStream` with additional support to fill in the data between chunk offsets from a `backingFile`. A new `zsync` capability is added to the dav app, which can be checked by the client to know if delta-sync is supported or not. A zsync dav property is also returned for files which have metadata on the server. This commit closes owncloud#16162.

gadLinux · 2018-03-27T01:48:42Z

Is this finally merged? and bounty claimed?

PVince81 · 2018-03-27T12:14:29Z

The server side code was merged at least

guruz · 2019-03-20T11:02:48Z

This feature will be in the upcoming 2.6 alpha release.
Meanwhile you can try it inside the daily builds 2.6.x https://download.owncloud.com/desktop/daily/

The basic approach is to store zsync metadata files in a folder called `files_zsync/` which stores them based on fileid. These metadata files can be requested by the client via a new route `dav/files/$user/$path?zsync`. They can also be deleted using the same route. This is implemented using a new `ServerPlugin` called `ZsyncPlugin`. Filesystem hooks are used to mirror any `copy/delete` operation on the base file or containing folders onto the metadata files. To ensure any changes server-side changes are will not produce out-of-sync metadata. The upload path is implemented by creating a new plugin `ChunkingPluginZsync`. The chunk file ids are now assumed to be named as the offsets into the original file. Special handling is done when a chunk named `.zsync` is found which is the generated client-side metadata. This means copying the contents to the `files_zsync/` folder. The core reason behind this is to ensure that both the metadata and the file are put in place atomically, as part of the final `MOVE` request. The implemenation adds a new class `AssemblyStreamZsync` which extends `AssemblyStream` with additional support to fill in the data between chunk offsets from a `backingFile`. A new `zsync` capability is added to the dav app, which can be checked by the client to know if delta-sync is supported or not. A zsync dav property is also returned for files which have metadata on the server. This commit closes #16162.

karlitschek added the enhancement label May 7, 2015

dragotin mentioned this issue May 22, 2015

Sync only the file change, not entire file [$1,755] owncloud/client#179

Closed

tflidd mentioned this issue Nov 6, 2015

Don't re-download/re-upload manually copied files owncloud/client#3422

Closed

ahmedammar mentioned this issue Nov 2, 2017

Implementation of delta-sync support on client-side. owncloud/client#6131

Closed

ownclouders added the status/STALE label Jan 5, 2018

ownclouders removed the status/STALE label Jan 15, 2018

ownclouders added the status/STALE label Feb 15, 2018

DeepDiver1975 closed this as completed in 0e0ab09 Mar 12, 2018

ownclouders removed the status/STALE label Mar 27, 2018

lock bot locked as resolved and limited conversation to collaborators Mar 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow remote-delta on sync files #16162

Allow remote-delta on sync files #16162

gadLinux commented May 7, 2015 •

edited by DeepDiver1975

Loading

jospoortvliet commented May 22, 2015

ogoffart commented May 22, 2015

dragotin commented May 22, 2015

danimo commented May 22, 2015

gadLinux commented May 25, 2015

danimo commented May 25, 2015

gadLinux commented May 27, 2015

ogoffart commented May 27, 2015

danimo commented May 27, 2015

gadLinux commented May 28, 2015

jospoortvliet commented May 28, 2015

rullzer commented Jun 30, 2015

powerpaul17 commented Jun 30, 2015

gadLinux commented Jun 30, 2015

jospoortvliet commented Jun 30, 2015

DeepDiver1975 commented Jun 30, 2015

rullzer commented Jul 1, 2015

warnerbryce commented Jul 13, 2015

rullzer commented Jul 30, 2015

gadLinux commented Jul 30, 2015

rullzer commented Jul 30, 2015

karlitschek commented Jul 30, 2015

agowa commented Sep 30, 2015

lavrentivs commented Dec 14, 2015

gadLinux commented Feb 6, 2016

rullzer commented Feb 6, 2016

gadLinux commented Feb 6, 2016

hodyroff commented Nov 3, 2017

ahmedammar commented Nov 3, 2017 •

edited

Loading

gadLinux commented Mar 27, 2018

PVince81 commented Mar 27, 2018

guruz commented Mar 20, 2019

Allow remote-delta on sync files #16162

Allow remote-delta on sync files #16162

Comments

gadLinux commented May 7, 2015 • edited by DeepDiver1975 Loading

jospoortvliet commented May 22, 2015

ogoffart commented May 22, 2015

dragotin commented May 22, 2015

danimo commented May 22, 2015

gadLinux commented May 25, 2015

danimo commented May 25, 2015

gadLinux commented May 27, 2015

ogoffart commented May 27, 2015

danimo commented May 27, 2015

gadLinux commented May 28, 2015

jospoortvliet commented May 28, 2015

rullzer commented Jun 30, 2015

Assumptions

Setup

Server

Client

powerpaul17 commented Jun 30, 2015

gadLinux commented Jun 30, 2015

jospoortvliet commented Jun 30, 2015

DeepDiver1975 commented Jun 30, 2015

rullzer commented Jul 1, 2015

warnerbryce commented Jul 13, 2015

rullzer commented Jul 30, 2015

gadLinux commented Jul 30, 2015

rullzer commented Jul 30, 2015

karlitschek commented Jul 30, 2015

agowa commented Sep 30, 2015

lavrentivs commented Dec 14, 2015

gadLinux commented Feb 6, 2016

rullzer commented Feb 6, 2016

gadLinux commented Feb 6, 2016

hodyroff commented Nov 3, 2017

ahmedammar commented Nov 3, 2017 • edited Loading

gadLinux commented Mar 27, 2018

PVince81 commented Mar 27, 2018

guruz commented Mar 20, 2019

gadLinux commented May 7, 2015 •

edited by DeepDiver1975

Loading

ahmedammar commented Nov 3, 2017 •

edited

Loading