Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changes in the generation of etags #5305

Closed
karlitschek opened this issue Oct 11, 2013 · 14 comments
Closed

changes in the generation of etags #5305

karlitschek opened this issue Oct 11, 2013 · 14 comments
Milestone

Comments

@karlitschek
Copy link
Contributor

currently the etags are a random number that is generated if a new file is created or an existing file changes.

I propose to change it like that:
The etag is a hash over the metadata of a file or folder.

For a file your concordinate the file path, the filename, the mtime, the size and the
file permissions. We then build an md5 out of that and use that as the ETAG.

For folders we concordinate all the etag of the entries (files/folders) in the folder and build an md5 out of that as ETAG.

The benefit of that is:

  1. The ETAG is reproducible. So it the filesystem cache is lost than it can be rebuild and you get the exact same ETAGs. This is very useful for backup and maintainance purposes.
  2. The ETAGs can also be calculated on the client which makes it possible to check if a file has changed on the server without downloading it.

References:
#5264
#523

@dragotin @danimo @icewind1991 @bartv2 @DeepDiver1975

@DeepDiver1975
Copy link
Member

file permissions can be user specific - especially when it comes down to shared files/folders

for folders we need a different mechanism - the etag has to change once it's content changed

@moscicki
Copy link

If I understad correctly ACLs are not propagated from the client to the server in the current implementation - just tested the unix permission bits between two clients.

In the ETAG context hashing on permissions would also require mapping across ACLs of different client filesystems (extended attributes of various flavours) and the server.

@karlitschek
Copy link
Contributor Author

the file ACL are not propagadet. But @DeepDiver1975 is right that the share permissions could be different. This is indeed a problem. Another problem is that the path of the file depends on the user. (with our without /Shared folder) and in the future even totally different. Perhaps it's enough to use only the file name?

@moscicki
Copy link

@karlitschek: I am not sure I understand why need we to include path and filename into the hash? I don't have a full overview of the framework of course, but I have a feeling that normally you use/create ETAGs in a context of a known path and filename. So this does not change for a file anyway but may catch you if you move a file to another folder on the server and at some point the client would be smart to do the move locally too. Do you require ETAGs to be universally unique for some reason within the owncloud framework?

I suppose that all clients should be able to calculate ETAG to the same value locally as the one calculated by the server IF the file was in sync but databases were lost? That means that on every client fs there should be enough common metadata stored about the file permissoins so that ETAG value is the same. I wonder what this common subset is.

@etiess
Copy link

etiess commented Oct 14, 2013

There already were a lot of exchanges about this topic, from here: #523 (comment)

In my opinion, they are really interesting to read/undesrtand. I just copy/paste a few of them here:

@moscicki:

Would you consider hashing of file metadata (mtime, size) instead of the content?

@zatricky:

I don't see how any performance issues/bandwidth waste would be mitigated by this.

@RandolfCarter

To put it very clearly: The only safe way to tell whether a file has changed (or two files are different) is to consider the file content - e.g. by comparing hash sums.

@tenacioustechie

Identifying changes via hashing of the contents of the file is definitional a prior proven way to sync which numerous other systems that use (git, hg etc as well as other syncing systems I'm not aware of).

@karlitschek

Let me explain why we don't use a hash at the moment.
The ETAG is a unique id of a file. From a client or sync algorithm perspective this exactly the same as a hash. If the ETAG is the same than it is the same file, if it's different than it is a different file. Just like a hash. So if this assumption is true than the syncing should work in exactly the same way as with a hash.

In the current implementation that ETAG is calculated using the metadata of a file like mtime, name, ... The reason is performance. ownCloud can be used with petabytes of storage. Some of them could be access and changed independently from ownCloud. Just look at the external filesystem features as an example. You can mount your huge S3,FTP,CIFS,Dropbox,... storage into ownCloud. If we want to calc hashes for every file than we have to download every single file at every sinc run to check if the hash/content of the file is the same. This is not possible obviously. Because of that we only look at the metadata for the ETAG.
I really think that this should be enough. I'm still waiting for a real life example where a file is changed in a directory but has still the same name,size,mtime, .... This really shouldn't happen.
If there are sync problems than there must be a different reason for that that we have to debug and fix.

@zatricky

Etags are important but cannot resolve the problem of needlessly re-uploading/downloading content which has already been manually/externally synchronised.

@moscicki

I just did an experiment which shows that mtimes ARE propagated between the clients (linux) albeit in a way which cannot be used for reliable hashing

@dragotin

I haven't yet thought through the whole conversation but here are some facts:

  • There is no full re-download of data required if the client database is lost. If the client db is missing, the file name and mtimes are compared and if they are equal, the files are NOT downloaded again.
  • mtimes are propagated as epoch values with a second precision if the system provides that, there are rumors that Windows file systems only provide a two seconds precision.
  • note that even if the system clocks of involved systems are not equal, the mtime of an individual file is not affected by that because that is just a number basically.
  • The idea to calculate the checksum of the file content of a file on client side to avoid re-upload if the file was not really changed but just touched is something to consider. However, I wonder if this is not more an academic than a practical problem, most users probably don't use touch that regularly. On files.

@zatricky

My using 'touch' was merely to demonstrate how simple reproducing the problem can be. I've reproduced changing the mtime in cmdline without using touch. This is simply done by copying/overwriting files. This covers #5231 as well as this issue's original submitter: http://sprunge.us/bbGB

@dragotin

calculating the MD5 on the client if the contents really changed is something we can discuss. (Hint: A specific feature request describing exactly that would help).

@zatricky

@etiess: I also see hash calculation as being much less of a burden than re-up/downloading. The client cpu is faster than the network in 99.9% of use-cases (disk shouldn't be compared as it is used for local hashing as well as re-up/download anyway). Network usage is also costly for some. Assuming support from the devs, my proposal going forward is as follows:
1. Implement server-side db/api support for content-hashing
2. Implement optional client-side support for content-hashing and client-side support for hash-based decision-making
3. Implement optional server-side content-hashing

Figuring out #5305 is a good step (and perhaps more urgent) in solving some of the issues mentioned here but won't cover everything unfortunately. This isn't a small problem/fix and so deserves some forethought.

I proposed to summarize our last discussions and your explanation on the sync algorythm somewhere. Is there already a wiki on it? Or a beginning of explanation? Do I start one? (in this case, which template could I use?)

@tenacioustechie
Copy link

Sounds like a great idea to clarify the sync algorithm (and maybe an explanation already exists, I'm not sure). I'd certainly appreciate it when trying to establish weather a symptom I'm seeing is actually an issue or not, and what could be causing it.

Whilst I think hashing the content would be the best approach, I understand @karlitschek point that some existing back ends do not lend themselves to this architecture.

I'm starting to think that hashing the metadata (filename, mtime, and size) on both client side and server side might be a good solution as it is predictably producing the same number on a given client. Which should minimise the risk of a re download of files that seems to be occurring for some of us.

I think it could also go some significant way to helping with owncloud/client#110 if you are careful to copy the files in a way to maintain the mtime (Is that possible? I don't know enough about mtime to know).

@zatricky
Copy link

@tenacioustechie commented:

... it is predictably producing the same number on a given client ...

I'm a little worried about the opposite problem. Can we accidentally (or purposefully) make a hash collision by virtue of the fact that it is easy to predict and reproduce the parameters?

Related: If we move or rename a file, the etag changes. Depending on how we currently handle moving files I'm not sure if that's anything to worry about.

@dragotin
Copy link
Contributor

We get a bit carried away here.
Let me restate what is really important about our idea of the ETags:

  • An ETag changes if the file changes. That happens if the content of the file is changed. In my opinion, and this was argued often, it also should change if the metadata of the file, ie. the modtime, changes (the touch case). It should change if the file gets moved, which in this consideration is the same as if the file is deleted and a new one appears.
  • The ETag must only be unique for exactly the same file. It does not matter if file a.txt and b.txt have the same ETag. But it does matter if file a.txt in revision 1 has the ETag X, than in revision 2 the ETag Y and than, in revision 3, again X. That must not happen.

What ETags are not:

  • ETags do not characterize the content. If the contents of a file changes, the ETag changes, as it would for a content based hash. But if the content of the file changes back to exactly what it was before, the ETag would not go back to what it was before, as it would for a content hash based ETag.
  • ETags do not help when it comes to renaming. For that another identifier that does not change on a file rename is needed additionally.

And now we have a new requirement added, that is

  • It should be possible to quickly recalculate the ETag of a given file. The algorithm should give the same ETag for a file on another client or server.

The key thing here seems to be what it means if two files are the same. The easy answer is: The file is byte wise equal and has the same meta data. But we consider a file also equal on A and B if we know that it was transmitted from A to B (ie. from client to server) and still has the same metadata as it had at the time it was transmitted. In that case both have the same ETag. Exactly that shows the limitation of the idea with recalculation of the ETag in general: If we loose the confidence that the file was transmitted from A to B or vice versa (by loosing the database which kind of proofs that) we can not really recalculate the ETag because any file content with the same metadata would result in the same ETag (remember we do not take the files content into account).

But maybe we can hold the claim that we actually can recalculate on server side because it is not possible to change the file without having the server to get notice of the change and thus controlling the ETag.

This might sound a bit academic but I think we should be aware of the pros and cons.

@tenacioustechie
Copy link

@zatricky I agree, the renaming bit is not important. And as @dragotin suggested, collisions are not important as the ETag will be associated with a file name as well, as long as you can't have a collision based on a time change (which I think is exceptionally unlikely if SHA1 is used based on Git's experience over many years). But the choice of algorithm I imagine is fairly important in this.

@dragotin I agree, I think we can claim the ETag can be regenerated for the same file based on the current assumptions. I think it may need to be tested a little more before it can be relied on. Whilst possibly academic, I also think its important that we know the limitations and possible issues with the ETag calculation.

Excuse my lack of knowledge here, just trying to help:

  • Is the resolution of mtime on windows, linux, mac the same under all circumstances? (as if they are not then we can't recalculated as the time aspect may have changed by transferring from A > B)
  • I think that's really what it comes down to can any aspect of the file meta data that is used in generating the ETag not be set exactly on all platforms through the sync process.
  • The hashing algorithm needs to be one that for the same content produces the same hash. It also is probably desirable for the hash to be drastically different with a small content change. SHA1 is used by other tools for exactly this purpose, but there may be other options.

@zatricky
Copy link

@dragotin commented:

The ETag must only be unique for exactly the same file.

Two clients can create two files of exactly the same size at the same path at the same time - but with different content, resulting in them both having the same etag.

The chances of that ever happening in the wild is slim, though I can imagine this happening on a regular basis in a poorly-engineered clustered application.

Weighing that up against the current practice of using a random etag, I feel the risk is reasonable.

@karlitschek
Copy link
Contributor Author

no longer current. closing.

@RandolfCarter
Copy link
Contributor

No longer current because...? Has it been replaced by another issue?

@martin-rueegg
Copy link
Contributor

the creation/updating of ETag is still an issue of discussion (e.g. #20474).

i just had an idea, that might be a solution for the long run or completely rubbish right from the beginning ;-)

what about a structured ETag? - there is no specification regarding the format ...

 opaque-tag = DQUOTE *etagc DQUOTE
 etagc      = %x21 / %x23-7E / obs-text
            ; VCHAR except double quotes, plus obs-text

  Note: Previously, opaque-tag was defined to be a quoted-string
  ([RFC2616], Section 3.11); thus, some recipients might perform
  backslash unescaping.  Servers therefore ought to avoid backslash
  characters in entity tags.

... nor does it limit it's length (which should still be reasonable short, of course).

just to bring in an idea on how a structured ETag might be of further consideration in order so solve certain issues (eg. regarding files on remote storage).

so what if ETag would be generated according to the following rule

ETag              = {version} ; {component} [ ; {component} [ , ... ] ]

version           = "ocs3"
                  ; 3 could be relating to the next version of the 
                  ; "Open Cloud Service" Specification (#21172)

component         = {content} [ , {content} ] = {hash_algorithm} [ , {content_divider} , ] {hash}

content           = [A-Z0-9] 
                  ; (case insensitive)
                  ; a numeric, alphanumeric or hexadecimal identifier code for what is content is
                  ; regarded to build up the {hash}
                  ; multiple contents are listed in order they are processed when hashing
                  ; codes need to be specified elsewhere. each code also needs to specify
                  ; - how specific content (e.g. dates) is formatted
                  ; -  if content is altered in any way (e.g.. encoding) before hashing

hash_algorithm    = sha1 | md5 | ...
                  ; hash algorithm used for {hash} generation

content_divider   = url-encoded string 
                  ; that is added between every two contents during {hash} generation.
                  ; optional. if missing or has zero length, nothing is added between two contents

hash              = etagc without ";" or "\"
                  ; terminated by ";" to be sure it is complete

so there could be one component for meta-data, and an other for the file content, if available. furthermore the generation rules for ETags could be different depending on the source of the file on the server, without braking compatibilty and still being a valid (and cachable) ETag.

the content-codes could be part of the OCS v2 draft 1 specification (see also #21172). e.g. code 0 could indicate a random or "unspecified" ETag ...

the client could interpret the information accordingly, if required.
also, the server might reply with e.g. 204 No Content if the content-component of the If-Match ETag idicates, that the content itself did not change, but the meta-data did. a 206 Partial Content would also be cool, with Content-Range: bytes 0/*, but not sure if that is allowed without having the client sent a Range header. in either case, the client might then do a PROPFIND to update the metadata, or resend the request without If-Match in order to get the content anyway.

furthermore it might be possible to generate the ETag on-the-fly on the server side, based on the ETag received by the client, or at least to respect only those {component(s)} included in the client's ETag request. by this, the client could specificity send a HEAD in order to find if meta-data or the file-content has changed.


PS: I don't know how much it makes sense to comment on an old issue and whether or not this comment will still regarded. on the other hand, this issue has been closed with no published obvious reason. of course i can open a new ticket, if requred, but i like to avoid creating more zombie issues as there are already many of them having a discussion, no conclusion, no update, and not closed ... ;-)

@lock
Copy link

lock bot commented Aug 7, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 7, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants