Skip to content
This repository has been archived by the owner on Sep 20, 2021. It is now read-only.

Grammar: Support strict UTF-8 strings and be strict regarding numbers #18

Merged
merged 8 commits into from
Aug 15, 2016

Conversation

Hywan
Copy link
Member

@Hywan Hywan commented Feb 5, 2016

Fix partially #17.

Regarding numbers:

We avoid using \d instead of [0-9] because \d might match other characters.

Regarding strings:

UTF-8 is an issue with JSON, because we must handle surrogate pairs
(1, 2). This patch implements UTF-8 not only for validating a datum
but also for generating a datum. It means that we only generate valid
UTF-8 strings and we only validate/recognize valid UTF-8 strings.

UTF-16 strings will follow in another patch.

Unfortunately, we decided to implement the whole string as a token, not
as rules. This is unfortunate because grammar coverage algorithms in
Hoa\Compiler applies only on rules, not on tokens. This is a potential
optimisation.

Now we have 159876 assertions!

We avoid using `\d` instead of `[0-9]` because `\d` _might_ match other
characters.
@Hywan
Copy link
Member Author

Hywan commented Feb 5, 2016

Please @Jir4 or @jubianchi, can you review it?
This is hard.

UTF-8 is an issue with JSON, because we must handle surrogate pairs
([1], [2]). This patch implements UTF-8 not only for validating a datum
but also for generating a datum. It means that we only generate valid
UTF-8 strings and we only validate/recognize valid UTF-8 strings.

UTF-16 strings will follow in another patch.

Unfortunately, we decided to implement the whole string as a token, not
as rules. This is unfortunate because grammar coverage algorithms in
`Hoa\Compiler` applies only on rules, not on tokens. This is a potential
optimisation.

[1]: http://tools.ietf.org/html/rfc7159#section-7
[2]: http://tools.ietf.org/html/rfc7159#section-8
@Hywan
Copy link
Member Author

Hywan commented Feb 5, 2016

Inspiration from https://github.com/php/php-src/blob/71c19800258ee3a9548af9a5e64ab0a62d1b1d8e/ext/json/json_scanner.re (and fixing a bug there, need to report it).

@Jir4
Copy link

Jir4 commented Feb 5, 2016

Umh ok, i'll review the first part and @jubianchi the second 😅

@Hywan Hywan changed the title Grammar: Suport UTF-8 strings and be strict regarding numbers Grammar: Suport strict UTF-8 strings and be strict regarding numbers Feb 8, 2016
@Hywan Hywan changed the title Grammar: Suport strict UTF-8 strings and be strict regarding numbers Grammar: Support strict UTF-8 strings and be strict regarding numbers Feb 8, 2016
@Hywan
Copy link
Member Author

Hywan commented Feb 8, 2016

Patches updated to support lexer.unicode instead of unicode pragma.

JSON grammar is $LL(k=0)$. This is good for performance to fix $k$ to
$0$, it avoids potential useless lookahead and it will fail early.
@Bhoat Bhoat merged commit 368eb22 into hoaproject:master Aug 15, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

Successfully merging this pull request may close these issues.

3 participants