Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: Official Tree-Sitter grammar #564

Open
JonGretar opened this issue Oct 7, 2024 · 8 comments
Open

Request: Official Tree-Sitter grammar #564

JonGretar opened this issue Oct 7, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@JonGretar
Copy link

There are a few attempts at a tree sitter grammar for Quarto out there but the tend to fall short or are barely usable.

As most major editors now support (sometimes exclusively) tree sitter grammars.

This would make supporting Quarto in Zed much easier.

@cscheid
Copy link
Contributor

cscheid commented Oct 7, 2024

The fundamental complication is that Quarto allows code cells that should be parsed with a different grammar, and we don't control ahead of time which language will be allowed there. I'm not sure how to do that within tree-sitter outside of being able to dynamically generate tree-sitter files, or providing a (massive) tree-sitter definition that includes "most" languages someone might be interested in.

With that said, it's probably not too hard for us to provide a Quarto tree-grammar for the Markdown parts of Quarto as a small set of extensions over commonmark.

In practice, the syntax highlighting in VS Code, Emacs and RStudio is done with help from the editor. In VS code, we use virtual docs. In Emacs, we use polymodes (I'm not sure about RStudio).

I think a good editor experience will necessarily require help from Zed here.

@cscheid cscheid added the enhancement New feature or request label Oct 7, 2024
@cscheid cscheid changed the title Official Tree-Sitter grammar Request: Official Tree-Sitter grammar Oct 7, 2024
@JonGretar
Copy link
Author

Hi there.

I may be wrong, but I believe that tree-sitter/editors include the concept of injections. That is to simply say that "this block here is Python" and then the editor will simply use it's python grammar. It then falls on the user to have installed the plugins for R,Python,yaml. There is a bit more about it in this link

Tree-sitter also supports extending other grammars. That is you can take the markdown grammar and extend that. Only defining what you add. How well that works I do not know.

@cscheid
Copy link
Contributor

cscheid commented Oct 8, 2024

That is to simply say that "this block here is Python" and then the editor will simply use it's python grammar

Yes, the problem is that we don't know ahead of time which language it is, so a single .qmd tree-sitter grammar will have to decide, ahead of time, which languages it supports, and that's not how Quarto works.

@JonGretar
Copy link
Author

JonGretar commented Oct 8, 2024

Yes, the problem is that we don't know ahead of time which language it is, so a single .qmd tree-sitter grammar will have to decide, ahead of time, which languages it supports, and that's not how Quarto works.

Aplogies if that came of terse. I am grateful for a great product and I was a bit unclear.

I am 90% sure that the tree-sitter grammar for regular markdown does not specify a list of languages it supports for the code blocks.
As that information is included in the beginning of the code block it just forwards whatever the text is there as the language. In fact then that part already works and in Zed I have configured the editor to open qmd simply as Markdown and it works with any language I throw at it and if that is a REPL supported language it runs that code block.

And as tree-sitter is extendable and there is the possibility of npm requireing the Markdown grammar and marking that as the parent, before adding the features that Quarto extends markdown with such as the div syntax and the modifications to img tags for example. You would not need to worry about supporting code blocks for all languages as that part is already done.

Again. Thank you for an amazing product. I especially loved seeing the Typst support added a few releases back.

@cscheid
Copy link
Contributor

cscheid commented Oct 8, 2024

My responses are not meant to imply we won't do it! Rather, I'm attempting to scope the problem.

I am 90% sure that the tree-sitter grammar for regular markdown does not specify a list of languages it supports for the code blocks.
As that information is included in the beginning of the code block it just forwards whatever the text is there as the language. In fact then that part already works and in Zed I have configured the editor to open qmd simply as Markdown and it works with any language I throw at it and if that is a REPL supported language it runs that code block.

I think but again am not sure that, in that case, it's not the tree-sitter grammar itself that's doing the forwarding work, but something in Zed controlling the "grammar injection", and this is what I meant by "I think a good editor experience will necessarily require help from Zed here." Otherwise, where's the knowledge that ```{ojs} code blocks refer to a specific dialect of Javascript, etc?

In any case, I think we'd start by finding a Commonmark grammar and extending it. We have internal reasons to want to do so, but it's not going to happen in the immediate future.

@JonGretar
Copy link
Author

I think but again am not sure that, in that case, it's not the tree-sitter grammar itself that's doing the forwarding work, but something in Zed controlling the "grammar injection",.

I understand it as if it's both. As in that injections are a tree sitter command that tells the editor: "find a parser that handles this language and parse this block with it". The editor still needs to have said parser, ie the user would have had to install that extension. I guess it's a mode command. So in a way it's the editor. But there is nothing Zed specific and NVim propably handles things in the same manner.

Otherwise, where's the knowledge that ```{ojs} code blocks refer to a specific dialect of Javascript, etc?

I had not thought of that. Yes in the case of ojs it would have required the javascript extension to specify speaks ojs. Or someone would have to make ojs extension that is simply a dumb child of javascript. Yes ojs, specifically, would be a headache.

In any case, I think we'd start by finding a Commonmark grammar and extending it. We have internal reasons to want to do so, but it's not going to happen in the immediate future.

I understand. As it's pretty early days in REPL for Zed it's not like it's a rush on it from my perspective. I mostly still work in NVim when it comes to data science (VSCode does not like batteries) but every now and then I open.

@DavisVaughan
Copy link
Collaborator

DavisVaughan commented Oct 22, 2024

@cscheid you can think of the injection as being a two step process. The quarto tree-sitter grammar would parse the file to the best of its abilities, recognizing fenced_code_blocks and marking the language within them (just like the markdown grammar honestly) so that other people can target those nodes of the tree easily.

Zed would take the resulting tree and do a second pass over certain sections of it. It would use an injection "query" like this one to find all of the fenced code blocks in the document:
https://github.com/tree-sitter-grammars/tree-sitter-markdown/blob/5cdc549ab8f461aff876c5be9741027189299cec/tree-sitter-markdown/queries/injections.scm#L1-L4

And it uses the @injection.language as a dynamic way to set the language on the fly based on the code blocks it finds. It would reparse sections according to the injection language, and the final result is something they could use for per-chunk syntax highlighting or "running" / REPL capabilities. Ultimately I don't think the Quarto tree-sitter grammar would be in charge of doing too much except writing the queries/injections.scm file (that's a standard file across tree-sitter grammars).


As a side note, note that tree-sitter supports both dynamic and static language detection:

# this one is dynamic
(fenced_code_block
  (info_string
    (language) @injection.language)
  (code_fence_content) @injection.content)

# this one is static
((html_block) @injection.content (#set! injection.language "html"))

@cscheid
Copy link
Contributor

cscheid commented Oct 22, 2024

@DavisVaughan Thanks, this is very helpful. (I'm going to followup with you at some point in 2025 when we have the cycles to do this.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants