Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

25,000 context limit is too low #1687

Closed
ryboe opened this issue Aug 13, 2018 · 13 comments
Closed

25,000 context limit is too low #1687

ryboe opened this issue Aug 13, 2018 · 13 comments

Comments

@ryboe
Copy link
Contributor

ryboe commented Aug 13, 2018

I'm the author of CSS3. Since the 25,000 context "sanity limit" was added to Sublime, my users have been hitting this error whenever they edit a file like HTML or PHP. I've only recently had the time to investigate.
image
25,000 seems like an insanely high number of contexts on the stack. I believe the assumption was that the only way to hit this limit was with an import loop (infinite recursion). CSS3 can hit this limit without any recursion. It is a total beast! It contains individual regexes for all 400+ CSS properties. I made this design choice because I thought it would be cool if only valid values were highlighted for every property.

In CSS3, there's a properties context which contains contexts for every CSS property. This is the context that exceeds the 25,000 limit. As a mitigation, I've created a properties-minimal context which removes ~100 lesser-used CSS properties. These will not be highlighted when you're editing CSS inside a <style> tag in an HTML document, for example.
Even with properties-minimal, my users are still hitting the limit if they have lots of 3rd party packages installed.

I'm guessing the 25,000 number was chosen arbitrarily. Despite the admittedly crazy number of contexts that CSS3 loads, my users have never reported any performance problems. That's a testament to how fast Sublime is.

I'd like to ask about the possibility of raising the limit to 35,000 or higher. Based on the lack of performance issues with CSS3 before the limit was added, I suspect it could be raised without causing any problems. What do you think?

This would resolve #1665.

@wbond
Copy link
Member

wbond commented Aug 13, 2018

The 25,000 context sanity limit has been present since .sublime-syntax was introduced. It isn't really new. The issue, I believe, is that the default syntaxes have been changing as they have been improving, and the number of contexts has gone up.

Even assuming we do increase the limit. The increase would go into the dev cycle first, and then into a stable build. This would likely mean it would be a number of months before any significant number of users received the new increased limit.

For now, it would seem you'll have to do some streamlining on the syntax to prevent the issue. It seems you've identified the pattern in the syntax causing the issue. Perhaps it would be possible to use something like YAML Macros by @Thom1729 to generate the final syntax from your preferred structure.

Also, to properly track this feature suggestion, it should be posted at https://github.com/SublimeTextIssues/Core/issues. There may already be an issue there for this, or a related suggestion.

@Thom1729
Copy link
Collaborator

Why is with_prototype needed? The core HTML syntax should be using embed/escape, not with_prototype.

@deathaxe
Copy link
Collaborator

@Thom1729 : Where does HTML syntax use with_prototype? If you mean the PHP syntax which includes HTML, the with_prototype is used to inject the php tags into any of HTML's contexts.

@y0ssar1an : Even though the performance hit might not be too obvious on modern systems, the RAM usage is. I hate the feeling of a text editor to need 1GB of RAM to just edit a file of some KBs of size. The more complex syntaxes cause ST's RAM usage to grow significantly as all the context's need to be stored somewhere.

I am with @wbond in the question of increasing that limit. I think hitting this limit shows the complexity of the syntax definitions has reached a level, we'd need to think about the whole strategy at all. While we are talking about syntax highlighting most syntaxes go far beyond that right now trying to be some kind of static compiler catching any language detail rather than concentrating on highlighting. This enables all the goto features and maybe more in the future, but comes along with some negative side effects as we see. I wonder how many meta scopes are not used by any plugin at all. Don't misunderstand me, I really like the abilities ST offers to create syntax definitions - one major reason for me to give it a try.

Maybe I am wrong, but I think the scenario of such a kind of complexity and the way different syntaxes embed/include/extend each other was not part of the initial design process of the sublime-syntax tokenizer. Especially the with_prototype thing, which seems to create many more copies of existing contexts.

Maybe some

@rwols
Copy link
Contributor

rwols commented Aug 13, 2018

1GB of RAM to just edit a file of some KBs of size.

which seems to create many more copies of existing contexts.

I would think both of these problems can be solved by using pointers to contexts instead of copying around the contexts. Though, one would need to implement a proper cycle detector.

@deathaxe
Copy link
Collaborator

Exactly what I had in mind.

@wbond
Copy link
Member

wbond commented Aug 14, 2018

Contexts are copied when they are modified. Trying to use pointers with with_prototype would destroy performance because N regex sets would need to be run on the text instead of 1.

@Thom1729
Copy link
Collaborator

The performance concern makes sense, but in practice, wouldn't N usually be 1, less often be 2, and only be more than that if there were multiple with_prototypes? With the introduction of embed/escape, that should hopefully be uncommon.

Also, how does escape work? I imagined that each escape would already be adding its own (singleton) regex set.

@deathaxe
Copy link
Collaborator

Performance is something the core devs can judge only as they know about the internals. At least one additional layer to handle the pointered regex sets would be needed.

In the end the number of regex patterns being evaluated shouldn't change. It's "just" a question about, creating a copy of the whole context and add the with_prototype patterns in front of that list vs. reuse the existing context and somehow link the additional with_prototype patterns to be processed first (or vice versa). But sure .. this needs more management and might impact overall performance and may make things even more complicated to handle.

The with_prototype is often used to inject contexts into an existing language to extend its scope and is one of the most common cases to extensively increase the number of contexts. Have languages like PHP in mind, which make use of HTML and use with_prototype to inject the <?php ... tags. This use case wouldn't work with embed so it might not be too uncommon.

I wonder whether we could find a more efficient way to extend existing syntaxes. Especially in cases where an injected context doesn't need to be part of each underlying context. Not an easy question, which I don't have an answer for, too.

Another idea to reduce the number of contexts needed in existing syntaxes was to at least make \s look ahead to the next line(s). Many syntaxes try to workaround the multiline blindness of patterns by splitting up quite simple regex-patterns into several contexts to support multiline statements correctly.

If the following pattern

  - match: (\w+)\s*(=)\s*(\w+)

could match

   key
   =
   value

things would become much easier and help reduce contexts, I think.

Today we need something like

- match: \w+
  push:
    - match: =
      set:
        - match: \d+
          pop: true
        - match: (?=\S)
          pop: true
     - match: (?=\S)
       pop: true

to do the same thing.

@Thom1729
Copy link
Collaborator

In the end the number of regex patterns being evaluated shouldn't change. It's "just" a question about, creating a copy of the whole context and add the with_prototype patterns in front of that list vs. reuse the existing context and somehow link the additional with_prototype patterns to be processed first (or vice versa).

As I understand it, all regexps for a context are pre-compiled into a single expression. If with_prototype rules were not compiled into the other context, then the engine would need to run two regexps instead of one.

  • match: (\w+)\s*(=)\s*(\w+)

The \s* is insufficient in most languages because of comments. Generally, you would want to match:

foo // This is a variable
=
/*
not a value
*/
value

Even notwithstanding the lookahead issue, I've found using several contexts to be more reliable and more flexible than relying upon capture groups.

@wbond
Copy link
Member

wbond commented Aug 14, 2018

I just saw your previous comment @Thom1729:

The performance concern makes sense, but in practice, wouldn't N usually be 1, less often be 2, and only be more than that if there were multiple with_prototypes? With the introduction of embed/escape, that should hopefully be uncommon.

Well, I think it is that all regex sets would have to be run and the results of each checked and compared to see which matches soonest. Our current implementation localizes everything in a context into a single contiguous chunk of memory, so with multiple regex sets we'd likely be spreading the necessary data for the next check out over different parts of memory.

It is true that in many cases N would be 1, but there are plenty of places where it wouldn't be. Ultimately I think a big factor in experimenting with something like this would be the complexity of implementation and time it would take.

Also, how does escape work? I imagined that each escape would already be adding its own (singleton) regex set.

Escape is run once per line, and effectively bounds the lexing to the escape point. So it has a different impact than splitting context lexing into multiple regex sets.

@Thom1729
Copy link
Collaborator

Our current implementation localizes everything in a context into a single contiguous chunk of memory, so with multiple regex sets we'd likely be spreading the necessary data for the next check out over different parts of memory.

That hadn't occurred to me. I can see how this could have a real performance impact even when N=2.

Thanks for the insight!

@deathaxe
Copy link
Collaborator

Even notwithstanding the lookahead issue, I've found using several contexts to be more reliable and more flexible than relying upon capture groups.

... and often faster.

Thanks @wbond for clarification about what set means. I had something like an array/list of rules in mind which are processed after each other. With the knowledge of a set to be single compiled regex I agree with you in both topics - complexity and performance.

Some debugging feature to see which syntaxes or constructs are most responsible for increasing number of contexts might help. Even though many syntaxes got more complex it doesn't seem obvious to me, why and where they hit the number of 25.000 contexts. Most of them only have a few hundred. Most impact seems to be caused by including each other.

@ryboe
Copy link
Contributor Author

ryboe commented Aug 19, 2018

Thank you for the enlightening discussion. It sounds like raising the limit would be a mistake. I'll look for ways to streamline CSS3 rather than contribute to increasing the memory usage. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants