Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sets and equivalences #54

Open
tonyg opened this issue May 8, 2018 · 4 comments
Open

Sets and equivalences #54

tonyg opened this issue May 8, 2018 · 4 comments

Comments

@tonyg
Copy link

tonyg commented May 8, 2018

TJSON looks really nice! Thank you for your work on the specification thus far. I have some questions relating to TJSON's equivalence relation.

A) Is this a valid set?

{"maybe-valid-set:S<s>": ["päron", "päron"]}

Its UTF-8 encoding is as follows:

00000000: 7b22 6d61 7962 652d 7661 6c69 642d 7365  {"maybe-valid-se
00000010: 743a 533c 733e 223a 205b 2270 c3a4 726f  t:S<s>": ["p..ro
00000020: 6e22 2c20 2270 61cc 8872 6f6e 225d 7d    n", "pa..ron"]}

B) Is this a valid set?

{"maybe-valid-set:S<O>": [ {"a:A<>":[]}, {"a:A<s>":[]} ]}

C) Is this a valid set?

{"maybe-valid-set:S<O>": [ {"a:s":"m", "z:s":"n"}, {"z:s":"n", "a:s":"m"} ]}

D) Is this a valid set?

{"maybe-valid-set:S<O>": [
    {"hi:d16": "48656c6c6f2c20776f726c6421"},
    {"hi:d64": "SGVsbG8sIHdvcmxkIQ"} ]}
@tarcieri
Copy link
Contributor

These are fantastic questions and call into question whether including sets as a data structure are actually even a good idea (cc @benlaurie)

To break this down into concrete issues:

A) If I understand correctly is about unicode canonicalization. In similar work (i.e. objecthash) this is a "knob", i.e. implementations may selectively enable unicode canonicalization, and in TJSON I'd suggest pursuing something similar. What the default should be is debatable, but I'd be in favor of canonicalizing by default. In the meantime this is unaddressed in the spec, but probably should be, and probably deserves its own issue.

B) In my opinion this should be rejected, as these two representations map to the same content, despite the type signatures being different

C) Should be rejected

D) Should be rejected

I think it might actually be interesting to lean on objecthash for solving this problem: if 2+ members of the set compute the same objecthash, the message should be invalid. However, I'm not sure it makes sense to make objecthash a mandatory entangling dependency of TJSON.

@tonyg
Copy link
Author

tonyg commented May 11, 2018

Leaning on objecthash is definitely interesting, since a strong hash function computes equivalence classes (with high probability). Alternatively, it could be within reach to define an equivalence relation for TJSON itself. This could be the foundation of lots of other stuff; JSON lacks such a relation and it's at the root of a lot of the headaches people have with it. (Maybe you could even define a total ordering over TJSON terms! That'd be even handier.)

I also am inclined to think B, C and D should be invalid sets.

Regarding A, though, and unicode normalization - could it be that the right thing is to leave it to readers/writers to normalize or not? And that the equivalence for strings should be code point by code point (or as RFC 7159 sec 8.3 says, "code unit by code unit", ew)?

As an outside crazy idea: could tagging an expected normalization form make sense?? Consumers could then reject and/or renormalize if a contained, tagged string did not match its declared expected normalization. {"fruit:s:nfc": "päron", "name:s:nfkc": "tony"}

Finally, I want to propose a couple more cases for consideration:

E) Is this a valid set?

{"maybe-valid-set:S<O>": [
  {"meaning-of-life:i": "42"},
  {"meaning-of-life:u": "42"} ]}

F) Is this a valid set?

{"maybe-valid-set:S<O>": [
  {"meaning-of-life:i": "42"},
  {"meaning-of-life:f": 42.0} ]}

@tarcieri
Copy link
Contributor

One thing that might massively simplify sets for now is to only allow sets of scalars. That would invalidate B-F.

Otherwise this seems like a deep rabbit hole...

@tonyg
Copy link
Author

tonyg commented Jun 11, 2018

That could definitely help. The user would specify S<i>, S<f>, S<s> and so on, making terms like {"x:S<i>": [ "42", 42, 42.0 ]} ill-formed. For S<s>, picking codepoint-by-codepoint comparison still seems like the right choice to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants