Optimization: Remove unnecessary elements in character classes #59

RunDevelopment · 2022-12-24T11:07:07Z

Is your feature request related to a problem? Please describe.

Pomsky currently does not remove unnecessary elements in character classes. E.g. [ w "abc" ] compiles to [\wabc] (Java). However, the abc is unnecessary because [\wabc] == \w.

Describe the solution you'd like

Remove unnecessary elements in character classes to optimize and simplify them.

Additional context

This requires knowing the precise set of characters accepted by each character class element. For an example implementation of this, checkout the regexp/no-dupe-characters-character-class rule.

The text was updated successfully, but these errors were encountered:

Aloso · 2022-12-28T12:15:23Z

Thanks for your feature request! This is already on my to-do list, but is tricky to get right.

Another reason why we need this is to prevent the following:

![w !d]

A negated character set matching neither \w nor \D matches nothing, which is forbidden in Rust. So I'm working on a way to determine whether two character classes overlap, are disjunct, or one is a subset of the other.

RunDevelopment · 2022-12-28T12:25:18Z

So I'm working on a way to determine whether two character classes overlap, are disjunct, or one is a subset of the other.

The exact set of characters matched by each character set is defined in pomsky, right? Then couldn't you parse them into an interval set? These interval sets can be efficiently unioned, intersected, and compared (equal, subset, disjoint).
That's what the regex crate also does under the hood. We also do this for eslint-plugin-regexp. Having this representation for characters, character sets, and character classes makes it pretty easy to implement some optimizations.

Aloso · 2022-12-28T12:33:51Z

Yes, except that we want to preserve \w, \d, \s, \p{Greek}, \p{Separator}, etc. rather than lowering them to a lot of ranges, so we can emit the smallest possible output.

RunDevelopment · 2022-12-28T13:11:23Z

Preserving character sets and Unicode properties is not mutually exclusive with using interval sets. It's of course true that interval sets do not preserve the elements that created them, but that's also not really a problem. I meant to suggest that the optimizer should have a way to get the interval set from character elements, not that character classes should be represented by interval sets.

Aloso · 2024-11-26T08:55:01Z

This has been implemented for the simple case (overlapping character ranges):

['a'-'f' 'b'-'g']  // --> [a-g]
['a'-'z' 'b']      // --> [a-z]

There's a special case to remove digit when word is present. Otherwise, character classes and Unicode properties aren't handled yet.

RunDevelopment added the enhancement New feature or request label Dec 24, 2022

Aloso added the C-optimize Issue or feature request for an optimization label Dec 28, 2022

Aloso added this to the v0.10 milestone Jan 14, 2023

Aloso mentioned this issue Oct 10, 2023

Optimizations (tracking issue) #95

Open

Aloso modified the milestones: v0.10, v0.11 Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization: Remove unnecessary elements in character classes #59

Optimization: Remove unnecessary elements in character classes #59

RunDevelopment commented Dec 24, 2022

Aloso commented Dec 28, 2022

RunDevelopment commented Dec 28, 2022

Aloso commented Dec 28, 2022

RunDevelopment commented Dec 28, 2022

Aloso commented Nov 26, 2024

Optimization: Remove unnecessary elements in character classes #59

Optimization: Remove unnecessary elements in character classes #59

Comments

RunDevelopment commented Dec 24, 2022

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Additional context

Aloso commented Dec 28, 2022

RunDevelopment commented Dec 28, 2022

Aloso commented Dec 28, 2022

RunDevelopment commented Dec 28, 2022

Aloso commented Nov 26, 2024