Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diffing_Engine: Implement hashing #1100

Closed
alelom opened this issue Jul 17, 2019 · 3 comments · Fixed by #1150
Closed

Diffing_Engine: Implement hashing #1100

alelom opened this issue Jul 17, 2019 · 3 comments · Fixed by #1150
Assignees
Labels
size:L Measured in days type:feature New capability or enhancement

Comments

@alelom
Copy link
Member

alelom commented Jul 17, 2019

After some thoughts and discussions, we concluded that the diffing (at least its first step, which is the "collection-level" diffing) should rather work with hashes than with custom comparers.

This is for several reasons:

  • Comparing two objects is not enough to keep track of who has changed from an iteration to the next. When you modify the objects, you can't tell who originally was different. The old version of the hash can be saved in the object itself, making possible to discern which objects have been modified in different iterations.
  • The base scenario of comparing two sets of object is expecting a deep (complete) comparison, so by default the diffing should look at all their properties. In some cases it could be required not to consider some of their properties though; this is where the hashing will have to be computed with some exceptions.
  • Hashing has the potential to make the "collection-level" diffing fast and data transmission lighter.

Downside:

  • Using hashing you lose the possibility of having two exactly identical objects. For example, you can't have two identical beams in the same position.

I don't think the downside should be seen as a problem. I can't find any good reason why we should support duplicates. If a case like that exists, I think we should rather rethink why/how that case is allowed to exist.

1. Hashing requirements

  1. Generate a hash that acts as a "Fingerprint" for our objects.
  2. The fingerprint must exclude a certain set of properties:
    • all properties that are change every time the user opens/closes the solution, notably BH.oM.Base.BHoMObject GUID
    • If the hash of the object is saved in the objects (e.g. in the Fragments), then that must be excluded as well.
    • Any other property that we decide is appropriate to exclude (BH.oM.Base.BHoMObject CustomData?)

2. Algorithm to generate the hash

The standard .NET GetHash() returns different results based on the environment. In other words, for example, for the same identical string two different users could get two different hashes.

So we can't just implement the standard .NET GetHash().

Since we need a platform-independent algorithm to generate hashes, we could use:

  • MD5, however that's generally regarded as non-safe against clashes.
  • SHA or variants (they are more safe against clashes)

3. Serialization to generate hashing

Hashing algorithms require the object to be Serialised in Byte[].

The serialisation requires all classes to be marked as [Serializable]. Which we don't want.
We can not add the Serializable attribute at runtime, either.
So we need another serializer.

3.1 Protobuf-net (deprecated proposal)

A workaround could be to use Protocol Buffers to serialise to byte[], selecting the properties we want to serialise. This would also allow to solve point (2): it seems easy to exclude properties we don't want to be part of the serialization (therefore not in the fingerprint).

3.2 Use of MongoDB

As Eduardo suggested, we already have a good serializer in Mongo_Toolkit and we can leverage it.
We just need to find a way to avoid serializing some types. These exceptions will work as custom comparer for objects.

@epignatelli
Copy link
Member

epignatelli commented Jul 18, 2019

We do serialise to bson before we get to the json string.

public static string ToJson(this object obj)
{
if (obj is string)
return "{ \"_t\": \"System.String\", \"_v\": " + BsonExtensionMethods.ToJson<string>(obj as string) + "}";
var jsonWriterSettings = new JsonWriterSettings { OutputMode = JsonOutputMode.Strict };
return Convert.ToBson(obj).ToJson<BsonDocument>(jsonWriterSettings);
}

public static BsonDocument ToBson(this object obj)
{
if (!m_TypesRegistered)
RegisterTypes();
if (obj is string)
{
BsonDocument document;
BsonDocument.TryParse(obj as string, out document);
return document;
}
else
return obj.ToBsonDocument();
}

Probably obj.ToBson().AsByteArray can help?

alelom added a commit that referenced this issue Jul 18, 2019
@alelom
Copy link
Member Author

alelom commented Jul 18, 2019

Probably obj.ToBson().AsByteArray can help?

As discussed during our chat, that's definitely an excellent suggestion, thanks!
I will surely have a look into that -- as we said the problem is you still need to tell mongo serialization how to exclude certain properties as per my requirement (2). I'll look into that.

@alelom
Copy link
Member Author

alelom commented Jul 26, 2019

I've now incorporated Eduardo's suggestion in the diffing engine -- hashing happens through MongoDB's BSON, that generates a SHA-64 hash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L Measured in days type:feature New capability or enhancement
Projects
None yet
2 participants