-
-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization: Remove unnecessary elements in character classes #59
Comments
Thanks for your feature request! This is already on my to-do list, but is tricky to get right. Another reason why we need this is to prevent the following: ![w !d] A negated character set matching neither |
The exact set of characters matched by each character set is defined in pomsky, right? Then couldn't you parse them into an interval set? These interval sets can be efficiently unioned, intersected, and compared (equal, subset, disjoint). |
Yes, except that we want to preserve |
Preserving character sets and Unicode properties is not mutually exclusive with using interval sets. It's of course true that interval sets do not preserve the elements that created them, but that's also not really a problem. I meant to suggest that the optimizer should have a way to get the interval set from character elements, not that character classes should be represented by interval sets. |
This has been implemented for the simple case (overlapping character ranges): ['a'-'f' 'b'-'g'] // --> [a-g]
['a'-'z' 'b'] // --> [a-z] There's a special case to remove |
Is your feature request related to a problem? Please describe.
Pomsky currently does not remove unnecessary elements in character classes. E.g.
[ w "abc" ]
compiles to[\wabc]
(Java). However, theabc
is unnecessary because[\wabc]
==\w
.Describe the solution you'd like
Remove unnecessary elements in character classes to optimize and simplify them.
Additional context
This requires knowing the precise set of characters accepted by each character class element. For an example implementation of this, checkout the
regexp/no-dupe-characters-character-class
rule.The text was updated successfully, but these errors were encountered: