Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare hashes for files where mtime and size match #5838

Closed
wants to merge 5 commits into from

Conversation

ckamm
Copy link
Contributor

@ckamm ckamm commented Jun 14, 2017

Addresses #5589

@moscicki

ckamm added 2 commits June 14, 2017 12:01
The client is very picky about date strings it accepts. If dates are
formatted with a non-C locale (such as localized weekday names), it
fails to parse it and tests fail in subtle ways.
@ckamm ckamm added this to the 2.4.0 milestone Jun 14, 2017
@ckamm ckamm self-assigned this Jun 14, 2017
@ckamm ckamm requested a review from ogoffart June 14, 2017 11:36
Copy link
Contributor

@ogoffart ogoffart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good work.
I made a few comment, but overall looks good

@@ -706,17 +706,26 @@ class FakeErrorReply : public QNetworkReply

class FakeQNAM : public QNetworkAccessManager
{
public:
using Observer = std::function<void(QByteArray verb, QString fileName)>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be not more generic to have the whole QNetworkRequest as a parameter?
So we could also test "Does the call contains the header FooBar"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in fact, it could return a QNetworkReply *, which would be returned if it is not nullptr

So some tests could throw custom error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will do.

@@ -120,6 +132,12 @@ QByteArray contentChecksumType()
return type;
}

QList<QByteArray> collisionSafeHashes()
{
static QList<QByteArray> list = { "SHA1" };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well well well.... haven't you heard the news about sha1?
But yeah, since we don't support anything better anyway...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHA1 may not be safe against attacks, but collisions are still overwhelmingly unlikely. Even MD5 should be okay.

// It could be a conflict even if size and mtime match!
// When we have the remote checksum available we can detect
// it without downloading the file.
is_conflict |= (ctx->current == REMOTE_REPLICA ? cur->checksumHeader : other->checksumHeader) != 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is duplicated (and different, in particular the check for collisionSafeHashes) from the logic in PropagateDownloadFile::start
Have you considered always creating the Download job (in any cases) and have the logic only there.

(But it's true that if it's is_conflict is strictly more "true" than the logic in PropagateDownloadFile::start, that's fine if they are different. as long as PropagateDownloadFile does not actually fetch the file)

Copy link
Contributor Author

@ckamm ckamm Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is an "early out" optimization to avoid creating unnecessary jobs. However, it won't gain us much in practice when servers generally have hashes for all files. It may be worth it to remove this logic and just keep the is_conflict=true case to simplify the code. Do you agree?

This will lead to a strong behavior change though: Previously, in the absence of hashes, we'd treat NEW/NEW with same mtime/size as no conflict - now we would.

So I think it's a good idea to switch on the discerning behavior only if we have a hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that's what i had in mind.

This will lead to a strong behavior change though: Previously, in the absence of hashes, we'd treat NEW/NEW with same mtime/size as no conflict - now we would.

I was thinking the Download job logic would be changed not to download the file if mtime and size are the same, even if there is no hash.

if (_item->_instruction == CSYNC_INSTRUCTION_CONFLICT
&& _item->_size == _item->log._other_size
&& _item->_modtime == _item->log._other_modtime
&& collisionSafeHashes().contains(checksumType)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about doing it even if it's not a collisionsafe hash, it would still be better than before where we were not comparing anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not allow it for Adler32 - it's not designed as a hash function, so collisions are even more likely than its short 32 bit output would suggest, in particular for short strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you accept collision without a hash (from the reconsile phase), but not if adler32 is used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adler32 is not good enough

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing the point, it goes the other way.
We were considering the file to be equal if they have the same size and mtime.
When there is a checksum, we now add a security check by comparing the checksum.

ckamm added 3 commits June 15, 2017 10:30
This is useful for monitoring what kind of network requests are
sent to the fake server. Such as "did this sync cause an upload?"
and "was there a propfind for this path?". It can also inject
custom replies.
* For conflicts where mtime and size are identical:

  a) If there's no remote checksum, skip (unchanged)
  b) If there's a remote checksum, create a PropagateDownload
     and compute the local checksum. If the checksums are identical,
     don't download the file and just update metadata.

* Avoid exposing the existence of checksumTypeId beyond the database
  layer. This makes handling checksums easier in general because they
  can usually be treated as a single blob.

  This change was prompted by the difficulty of producing file_stat_t
  entries uniformly from PROPFINDs and the database.
@ckamm ckamm force-pushed the downloadchecksum branch from 97b0ece to 716b99d Compare June 15, 2017 08:58
@ckamm
Copy link
Contributor Author

ckamm commented Jun 15, 2017

@ogoffart I've fixed up the network observer addition, but left my other changes in a separate commit so they're easier to review. Let me merge when you're done, so I can wipe away that fixup commit.

@ogoffart
Copy link
Contributor

👍

@ckamm
Copy link
Contributor Author

ckamm commented Jun 15, 2017

merged manually

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants