-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changes in the generation of etags #5305
Comments
file permissions can be user specific - especially when it comes down to shared files/folders for folders we need a different mechanism - the etag has to change once it's content changed |
If I understad correctly ACLs are not propagated from the client to the server in the current implementation - just tested the unix permission bits between two clients. In the ETAG context hashing on permissions would also require mapping across ACLs of different client filesystems (extended attributes of various flavours) and the server. |
the file ACL are not propagadet. But @DeepDiver1975 is right that the share permissions could be different. This is indeed a problem. Another problem is that the path of the file depends on the user. (with our without /Shared folder) and in the future even totally different. Perhaps it's enough to use only the file name? |
@karlitschek: I am not sure I understand why need we to include path and filename into the hash? I don't have a full overview of the framework of course, but I have a feeling that normally you use/create ETAGs in a context of a known path and filename. So this does not change for a file anyway but may catch you if you move a file to another folder on the server and at some point the client would be smart to do the move locally too. Do you require ETAGs to be universally unique for some reason within the owncloud framework? I suppose that all clients should be able to calculate ETAG to the same value locally as the one calculated by the server IF the file was in sync but databases were lost? That means that on every client fs there should be enough common metadata stored about the file permissoins so that ETAG value is the same. I wonder what this common subset is. |
There already were a lot of exchanges about this topic, from here: #523 (comment) In my opinion, they are really interesting to read/undesrtand. I just copy/paste a few of them here:
I proposed to summarize our last discussions and your explanation on the sync algorythm somewhere. Is there already a wiki on it? Or a beginning of explanation? Do I start one? (in this case, which template could I use?) |
Sounds like a great idea to clarify the sync algorithm (and maybe an explanation already exists, I'm not sure). I'd certainly appreciate it when trying to establish weather a symptom I'm seeing is actually an issue or not, and what could be causing it. Whilst I think hashing the content would be the best approach, I understand @karlitschek point that some existing back ends do not lend themselves to this architecture. I'm starting to think that hashing the metadata (filename, mtime, and size) on both client side and server side might be a good solution as it is predictably producing the same number on a given client. Which should minimise the risk of a re download of files that seems to be occurring for some of us. I think it could also go some significant way to helping with owncloud/client#110 if you are careful to copy the files in a way to maintain the mtime (Is that possible? I don't know enough about mtime to know). |
@tenacioustechie commented:
I'm a little worried about the opposite problem. Can we accidentally (or purposefully) make a hash collision by virtue of the fact that it is easy to predict and reproduce the parameters? Related: If we move or rename a file, the etag changes. Depending on how we currently handle moving files I'm not sure if that's anything to worry about. |
We get a bit carried away here.
What ETags are not:
And now we have a new requirement added, that is
The key thing here seems to be what it means if two files are the same. The easy answer is: The file is byte wise equal and has the same meta data. But we consider a file also equal on A and B if we know that it was transmitted from A to B (ie. from client to server) and still has the same metadata as it had at the time it was transmitted. In that case both have the same ETag. Exactly that shows the limitation of the idea with recalculation of the ETag in general: If we loose the confidence that the file was transmitted from A to B or vice versa (by loosing the database which kind of proofs that) we can not really recalculate the ETag because any file content with the same metadata would result in the same ETag (remember we do not take the files content into account). But maybe we can hold the claim that we actually can recalculate on server side because it is not possible to change the file without having the server to get notice of the change and thus controlling the ETag. This might sound a bit academic but I think we should be aware of the pros and cons. |
@zatricky I agree, the renaming bit is not important. And as @dragotin suggested, collisions are not important as the ETag will be associated with a file name as well, as long as you can't have a collision based on a time change (which I think is exceptionally unlikely if SHA1 is used based on Git's experience over many years). But the choice of algorithm I imagine is fairly important in this. @dragotin I agree, I think we can claim the ETag can be regenerated for the same file based on the current assumptions. I think it may need to be tested a little more before it can be relied on. Whilst possibly academic, I also think its important that we know the limitations and possible issues with the ETag calculation. Excuse my lack of knowledge here, just trying to help:
|
@dragotin commented:
Two clients can create two files of exactly the same size at the same path at the same time - but with different content, resulting in them both having the same etag. The chances of that ever happening in the wild is slim, though I can imagine this happening on a regular basis in a poorly-engineered clustered application. Weighing that up against the current practice of using a random etag, I feel the risk is reasonable. |
no longer current. closing. |
No longer current because...? Has it been replaced by another issue? |
the creation/updating of ETag is still an issue of discussion (e.g. #20474). i just had an idea, that might be a solution for the long run or completely rubbish right from the beginning ;-) what about a structured ETag? - there is no specification regarding the format ...
... nor does it limit it's length (which should still be reasonable short, of course). just to bring in an idea on how a structured ETag might be of further consideration in order so solve certain issues (eg. regarding files on remote storage). so what if ETag would be generated according to the following rule
so there could be one component for meta-data, and an other for the file content, if available. furthermore the generation rules for ETags could be different depending on the source of the file on the server, without braking compatibilty and still being a valid (and cachable) ETag. the content-codes could be part of the OCS v2 draft 1 specification (see also #21172). e.g. code the client could interpret the information accordingly, if required. furthermore it might be possible to generate the ETag on-the-fly on the server side, based on the ETag received by the client, or at least to respect only those PS: I don't know how much it makes sense to comment on an old issue and whether or not this comment will still regarded. on the other hand, this issue has been closed with no published obvious reason. of course i can open a new ticket, if requred, but i like to avoid creating more zombie issues as there are already many of them having a discussion, no conclusion, no update, and not closed ... ;-) |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
currently the etags are a random number that is generated if a new file is created or an existing file changes.
I propose to change it like that:
The etag is a hash over the metadata of a file or folder.
For a file your concordinate the file path, the filename, the mtime, the size and the
file permissions. We then build an md5 out of that and use that as the ETAG.
For folders we concordinate all the etag of the entries (files/folders) in the folder and build an md5 out of that as ETAG.
The benefit of that is:
References:
#5264
#523
@dragotin @danimo @icewind1991 @bartv2 @DeepDiver1975
The text was updated successfully, but these errors were encountered: