-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use a script instead of submodules for grammars #1560
Conversation
a8dd602
to
5f9a75e
Compare
You'll also need to change the documentation that mentions the submodule system, such as README.md and the adding languages docs. |
Note: You can use |
Bonus points:
This would make it easier to build specific grammars and avoid issues with We could also turn the build.rs into a shell script, that would allow users to compile grammars without needing a Rust installation (all we're really doing is invoking the C compiler) |
There seem to be a few people that are in favor of this PR, so could someone help me understand? It seems like this approach is also:
Is this actually faster than In general, I like the idea of moving away from submodules because they can be a huge pain, but if the purpose is to improve flexibility, then why pin versions? It seems like we're just reimplementing git submodules here in bash. I understand the reasons in general for pinning versions, as it provides a reproducible set of deps that were tested against, but they naturally come with maintenance overhead. And will this really improve that? It seems to me like between flexibility and reproducibility, this is like the worst of both worlds. |
Rather than improve performance, the goal is to have the set of grammars easier to configure. Using submodules isn't scaling for us anymore: CI tests now frequently fail on fetching these submodules due to network issues because we have so many. With a script the user can:
I think most users already ignore the git submodule warnings. We frequently get questions on why
We're pinning versions because grammars often have breaking changes that break existing highlights. If they add new grammar items we also need to add them to our queries. |
We could make this a part of |
Does this approach help? Don't we have to clone just as many repos? Or is it that right now, we must clone everything, but this way we don't?
How will this work? Will the user edit the
Can't they do this now? If I'm not mistaken, users are currently able to do this by just building the .so, and putting that and the queries in the runtime folder, right? Would this change that? If you mean that the user workflow for adding a new language becomes just adding a line to This might be going off topic a bit, but this is also why I think having languages be plugins would be the best long term solution. The tree-sitter dependency management, build process, and helix tree-sitter queries can live completely out of tree. This is how vim, VSCode, and probably others do it.
Does how we fetch and manage the grammars have a bearing on how we build them? To be honest, I'm not really sure why the Rust toolchain is involved in them at all. Is that a vestige of when the grammars were statically linked into the hx binary?
Yeah I understand users will be users, but this means that even for devs who do pay attention, it now becomes much more easy to not realize that your grammars are out of date. There is nothing to tell you so after doing a git pull of the latest master, for example. Just all of a sudden the highlighting and tree-sitter navigation is broken and you have no idea it's because the grammar and queries were updated. Now a dev has to always think about updating their grammars, whereas now, git tells you when you have to.
That certainly makes sense, but then what kind of flexibility are we after? |
Don't get me wrong here, I hate git submodules with a burning passion, and I would love to get rid of them. I would not be loathe to work with bash instead of submodules. But as much as I hate them, they do provide some measure of benefit, and just based on the albeit limited experience I have with the repo so far, this particular way of replacing them seems like an inequitable trade-off, and it seems like there are other desires at play that have varying relevance to dependency management. |
You can modify the list of repositories before running the script, or use a custom list. We can also avoid git altogether by fetching
Right, but nobody actually does that since you need to figure out the correct compiler incantation to do so. Having a script for it would make this a viable option.
Neovim does, but it requires you to have a properly set up compiler so it can compile the grammars, there's some downsides there (as opposed to fetching them from a release). VSCode doesn't actually use tree-sitter. What we really want in the long term is tree-sitter/tree-sitter#930 stretch goal where grammars are
We could still hook into this in |
Hm, if we're editing the file that's checked in, this will cause probably more accidental commits than submodules. Maybe the script could check for a
Is this different than the reality now? If your grammars don't all have prebuilt binaries (do any of them?), then your machine must be capable of compiling them.
This would be nice, and very similar in spirit to the plugin system imagined for helix, but even when/if that lands, are we going to be able to expect all tree-sitter grammars from the community that we depend on to compile and distribute prebuilt wasm files? There are dozens. There isn't going to be a tree-sitter distribution platform. The more likely scenario is that many will still have to be compiled from source, right? I imagine the same will be true of the plugin system. |
The script ends up being about twice as fast as submodules, even with the I think you're right @dead10ck that this change doesn't really significantly improve the grammars situation, plus it's not great that it introduces a bespoke CLI and there's no git helper to tell you you're out of date. But this is meant to be a stepping stone for future improvements, namely the one people are most interested in afaict: being able to have a per-user subset of grammars. For that, I think having the script look for
Thanks for bringing this up, this will be a problem. There needs to be a way to specify a version of the queries used with a particular version of the grammar or else query analysis will fail when the grammar makes a breaking change to the set of nodes. The most minor "fix" to this would be to improve the error message that happens when query analysis fails and have it say that you need to I think a more maintainable solution would be to do what nvim-treesitter does and have the queries in a repository with the revisions, so when you install/update a language it pulls in a version of the grammar and a version of the queries that work together. This approach might also work well in a WASM-based future: we could have a github actions automation that compiles each language's grammar into the necessary WASM and then the user simply downloads the WASM and the queries (by hand, with a script, or maybe a built-in command). |
Alternatively, it wouldn't be much work to use something more formal like a CSV rather than this SSV (space separated values) document. Since there isn't much information to track here, I thought it'd be wise to keep the format as simple as possible
macos does not have `readarray`, so we need to do this with a regular array construction it's a bit less readable but it seems to work the same
11657d0
to
80a49fa
Compare
https://github.com/tree-sitter/tree-sitter-agda ca69cdf485e9ce2b2ef0991a720aa88d87d30231 | ||
https://github.com/tree-sitter/tree-sitter-bash a8eb5cb57c66f74c63ab950de081207cccf52017 | ||
https://github.com/tree-sitter/tree-sitter-c f05e279aedde06a25801c3f2b2cc8ac17fac52ae | ||
https://github.com/uyha/tree-sitter-cmake f6616f1e417ee8b62daf251aa1daa5d73781c596 | ||
https://github.com/stsewd/tree-sitter-comment 5dd3c62f1bbe378b220fe16b317b85247898639e | ||
https://github.com/tree-sitter/tree-sitter-cpp e8dcc9d2b404c542fd236ea5f7208f90be8a6e89 | ||
https://github.com/tree-sitter/tree-sitter-c-sharp 53a65a908167d6556e1fcdb67f1ee62aac101dda | ||
https://github.com/tree-sitter/tree-sitter-css 94e10230939e702b4fa3fa2cb5c3bc7173b95d07 | ||
https://github.com/UserNobody14/tree-sitter-dart 6a25376685d1d47968c2cef06d4db8d84a70025e | ||
https://github.com/camdencheek/tree-sitter-dockerfile 7af32bc04a66ab196f5b9f92ac471f29372ae2ce | ||
https://github.com/elixir-lang/tree-sitter-elixir f5d7bda543da788bd507b05bd722627dde66c9ec | ||
https://github.com/elm-tooling/tree-sitter-elm bd50ccf66b42c55252ac8efc1086af4ac6bab8cd | ||
https://github.com/ram02z/tree-sitter-fish 04e54ab6585dfd4fee6ddfe5849af56f101b6d4f | ||
https://github.com/the-mikedavis/tree-sitter-git-commit 066e395e1107df17183cf3ae4230f1a1406cc972 | ||
https://github.com/the-mikedavis/tree-sitter-git-config 0e4f0baf90b57e5aeb62dcdbf03062c6315d43ea | ||
https://github.com/the-mikedavis/tree-sitter-git-diff c12e6ecb54485f764250556ffd7ccb18f8e2942b | ||
https://github.com/the-mikedavis/tree-sitter-git-rebase 332dc528f27044bc4427024dbb33e6941fc131f2 | ||
https://github.com/theHamsta/tree-sitter-glsl 88408ffc5e27abcffced7010fc77396ae3636d7e | ||
https://github.com/tree-sitter/tree-sitter-go 0fa917a7022d1cd2e9b779a6a8fc5dc7fad69c75 | ||
https://github.com/bkegley/tree-sitter-graphql 5e66e961eee421786bdda8495ed1db045e06b5fe | ||
https://github.com/tree-sitter/tree-sitter-haskell b6ec26f181dd059eedd506fa5fbeae1b8e5556c8 | ||
https://github.com/tree-sitter/tree-sitter-html d93af487cc75120c89257195e6be46c999c6ba18 | ||
https://github.com/tree-sitter/tree-sitter-java bd6186c24d5eb13b4623efac9d944dcc095c0dad | ||
https://github.com/tree-sitter/tree-sitter-javascript 4a95461c4761c624f2263725aca79eeaefd36cad | ||
https://github.com/tree-sitter/tree-sitter-json 65bceef69c3b0f24c0b19ce67d79f57c96e90fcb | ||
https://github.com/tree-sitter/tree-sitter-julia 12ea597262125fc22fd2e91aa953ac69b19c26ca | ||
https://github.com/latex-lsp/tree-sitter-latex 7f720661de5316c0f8fee956526d4002fa1086d8 | ||
https://github.com/Julian/tree-sitter-lean d98426109258b266e1e92358c5f11716d2e8f638 | ||
https://github.com/cbarrete/tree-sitter-ledger 0cdeb0e51411a3ba5493662952c3039de08939ca | ||
https://github.com/benwilliamgraham/tree-sitter-llvm 3b213925b9c4f42c1acfe2e10bfbb438d9c6834d | ||
https://github.com/Flakebi/tree-sitter-llvm-mir 06fabca19454b2dc00c1b211a7cb7ad0bc2585f1 | ||
https://github.com/nvim-treesitter/tree-sitter-lua 6f5d40190ec8a0aa8c8410699353d820f4f7d7a6 | ||
https://github.com/alemuller/tree-sitter-make a4b9187417d6be349ee5fd4b6e77b4172c6827dd | ||
https://github.com/MDeiml/tree-sitter-markdown ad8c32917a16dfbb387d1da567bf0c3fb6fffde2 | ||
https://github.com/cstrahan/tree-sitter-nix 50f38ceab667f9d482640edfee803d74f4edeba5 | ||
https://github.com/tree-sitter/tree-sitter-ocaml 23d419ba45789c5a47d31448061557716b02750a | ||
https://github.com/ganezdragon/tree-sitter-perl 0ac2c6da562c7a2c26ed7e8691d4a590f7e8b90a | ||
https://github.com/tree-sitter/tree-sitter-php 57f855461aeeca73bd4218754fb26b5ac143f98f | ||
https://github.com/yusdacra/tree-sitter-protobuf 19c211a01434d9f03efff99f85e19f967591b175 | ||
https://github.com/tree-sitter/tree-sitter-python d6210ceab11e8d812d4ab59c07c81458ec6e5184 | ||
https://github.com/tree-sitter/tree-sitter-regex e1cfca3c79896ff79842f057ea13e529b66af636 | ||
https://github.com/tree-sitter/tree-sitter-ruby dfff673b41df7fadcbb609c6338f38da3cdd018e | ||
https://github.com/tree-sitter/tree-sitter-rust a360da0a29a19c281d08295a35ecd0544d2da211 | ||
https://github.com/tree-sitter/tree-sitter-scala 0a3dd53a7fc4b352a538397d054380aaa28be54c | ||
https://github.com/Himujjal/tree-sitter-svelte 349a5984513b4a4a9e143a6e746120c6ff6cf6ed | ||
https://github.com/tree-sitter/tree-sitter-swift a22fa5e19bae50098e2252ea96cba3aba43f4c58 | ||
https://github.com/Flakebi/tree-sitter-tablegen 568dd8a937347175fd58db83d4c4cdaeb6069bd2 | ||
https://github.com/ikatyang/tree-sitter-toml 7cff70bbcbbc62001b465603ca1ea88edd668704 | ||
https://github.com/tree-sitter/tree-sitter-tsq b665659d3238e6036e22ed0e24935e60efb39415 | ||
https://github.com/tree-sitter/tree-sitter-typescript 3e897ea5925f037cfae2e551f8e6b12eec2a201a | ||
https://github.com/ikatyang/tree-sitter-vue 91fe2754796cd8fba5f229505a23fa08f3546c06 | ||
https://github.com/szebniok/tree-sitter-wgsl f00ff52251edbd58f4d39c9c3204383253032c11 | ||
https://github.com/ikatyang/tree-sitter-yaml 0e36bed171768908f331ff7dff9d956bae016efb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this just introduce an additional step to manually edit this file and put in the submodule? I think this may be harder for new contributors to use rather than just git submodule add
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding it here removes any submodule steps, so no need to git submodule add
and set it as shallow. It's not really an improvement though: you end up still needing to ./scripts/grammars sync
once you add it to the file, so it ends up being the same number of steps to add a new grammar.
One area where it helps a lot is when a grammar changes remotes (like #830). The script handles that just with an update to revisions.txt but with submodules it's a total nightmare, you have to go edit multiple files under .git/
to fix the submodule. This is usually only relevant if you need to fork a grammar and make some changes when adding it, I don't know how common that is for others 🙃
The main thing the script improves is being able to have a per-user subset of grammars (albeit it's not implemented in this branch), which I can't think of a way to do well with submodules.
This is not usually necessary because these environment variables are set by default in a build script.
I ported the helix-syntax/build.rs builder to an xtask instead of adding a Still to do: update the flake to read revisions.txt and fetch the grammars. |
Here's how the code looks inside the tree-sitter CLI if it helps: https://github.com/tree-sitter/tree-sitter/blob/15860392c5952e3ee6f93524fc75de5ef14493c0/cli/loader/src/lib.rs#L301-L444 It seems to assume a certain structure, but I'm guessing that doesn't work for everything. |
I think tree-sitter ends up consulting a grammar repository's package.json to detect repositories that have grammars in subdirectories (see here), and I think in helix that package.json info ends up rewritten in languages.toml. Maybe it's better to have the revisions and remotes that are currently in revisions.txt be in languages.toml instead, so all of your language configuration goes in one place? Writing a toml parser in bash sounds like a fun project but writing a toml parser in windows batch script would probably drive me crazy 😅: maybe the fetching part of this script should be an xtask as well? It seems a bit wasteful to use xtask for basically shelling out to git, but I think it'd be easier to maintain cross-platform. Plus it'd be reusable when/if adding this functionality to helix runtime or making a new CLI to do it. What do you think? |
build_grammars adapts the functionality that previously came from helix-syntax to be used at runtime from the command line flags. fetch_grammars wraps command-line git to perform the same actions previously done in the scripts in helix-editor#1560.
build_grammars adapts the functionality that previously came from helix-syntax to be used at runtime from the command line flags. fetch_grammars wraps command-line git to perform the same actions previously done in the scripts in helix-editor#1560.
build_grammars adapts the functionality that previously came from helix-syntax to be used at runtime from the command line flags. fetch_grammars wraps command-line git to perform the same actions previously done in the scripts in helix-editor#1560.
build_grammars adapts the functionality that previously came from helix-syntax to be used at runtime from the command line flags. fetch_grammars wraps command-line git to perform the same actions previously done in the scripts in helix-editor#1560.
build_grammars adapts the functionality that previously came from helix-syntax to be used at runtime from the command line flags. fetch_grammars wraps command-line git to perform the same actions previously done in the scripts in helix-editor#1560.
build_grammars adapts the functionality that previously came from helix-syntax to be used at runtime from the command line flags. fetch_grammars wraps command-line git to perform the same actions previously done in the scripts in #1560.
closes #1549
there's still a bunch to do before this is ready:
grammars.cmd
for windowsfall back to a regular clone when the remote doesn't support cloning at exact revisions(I believe most/all popular hosted git remotes support fetch-at-revision these days)check
/status
command that checks if you're currently syncedafetch
command that downloads a new grammar repository (maybe this isn't necessary, you can always justgit clone
)a(eh, I think it makes sense to download new grammars using the revisions.txt file declaratively)save
command that updatesrevisions.txt
with your current local grammar set (useful for adding new grammars)But the results are already pretty nice. By taking advantage of shallow fetches at exact revisions, we get down to ~30s to fetch all grammars and ~1.5s to re-run
sync
(idempotently). For perspective:git submodules update --init
takes ~250s (a little over 4 minutes)git submodule update --init --depth 1
takes ~80sIdeally I'd like this script to allow one to hack on a grammar as they're integrating it, so it's meant to not care about other remotes and to fail when checking out a revision that would overwrite local modifications.