Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading Spago workspace configuration hangs "forever" #1140

Closed
wclr opened this issue Dec 15, 2023 · 17 comments
Closed

Reading Spago workspace configuration hangs "forever" #1140

wclr opened this issue Dec 15, 2023 · 17 comments

Comments

@wclr
Copy link
Contributor

wclr commented Dec 15, 2023

I have a big (?), not actually that big, project. Spago hangs with the Reading Spago workspace configuration... message (for ages? - I could not wait). I have added dirs with a lot of content to the ignore pattern (in the bundle code), then it passes.

So, I think that this step of searching for spago.yaml with glob throughout the whole content may be a bit excessive? I would prefer this to be configured and not spend any time searching especially in unknown environments.

@f-f
Copy link
Member

f-f commented Dec 15, 2023

I think it'd be worth investigating more about your current setup before going for a configurable option - after all Spago is certainly not the first tool attempting this, e.g. Bazel works in the same way, and it handles Google-sized repos so I'm certain it's possible to make it work.

A few questions:

  1. are you running the latest release? If not I'd suggest upgrading and trying again, there's a lot of speed improvements in there
  2. why is there so much content inside the repo?
  3. how long does it take exactly? It'd be useful to understand if we're talking about a minute or an hour
  4. adding the -v flag to the spago command includes timing info, it would be useful to have a look at that to see where the time is being spent

@wclr
Copy link
Contributor Author

wclr commented Dec 15, 2023

  1. Yes
  2. It is usual source files and working dev content: node_modules, pus output, .git, .psci_modules, .tmp,(some temporray stuff there), and other related stuff, typescript sources.
  3. if to include nothing I can wait for several minutes and it is still doesn't finish.

Well I manged to reduce it by adding **/node_modules/** and also a dir that contains external packages (.yalc dir - https://github.com/wclr/yalc). Now it is takes about 200 ms. That is fine. There should be a way to configure such ignores at least, but I would prefer package manager not to spend any exessive time on each (install command?) run, to perform fs traversing that is not needed for the project. So I believe if this would be a default behaviour it should be configured.

@JordanMartinez
Copy link
Contributor

Could you provide a reproducible repo? The details provided thus far aren't very helpful for tracking down the source of this problem (if any). As an example, what is the output of spago build --verbose? That would give us a lot more context as to what is going on here.

@f-f
Copy link
Member

f-f commented Dec 15, 2023

There should be a way to configure such ignores at least

Spago will ignore the same things as git so if it's in your .gitignore it should be fine.

It would be helpful to include a reproducible repo as Jordan mentioned, or at least the log of an invocation of spago build -v so we can see some timings.

@wclr
Copy link
Contributor Author

wclr commented Dec 15, 2023

I'm not sure what you mean by reproducible, I have a project with quite a bit of node dependencies, also the reason may be in using PNPM as a package manager that uses symlinks. Why should spago even search for packages inside node_modules?

@JordanMartinez
Copy link
Contributor

I'm not sure what you mean by reproducible, I have a project with quite a bit of node dependencies,

I mean something I can git clone and try out, of course. But if you are not able to provide that information, then at least spago build --verbose would be helpful. We log a lot of info here to help troubleshoot problems like this. By not providing this information (yet), it's hard to know more about the repo in general and how long specific things are taking. That output may give us a general idea of where in the code to even start looking.

@wclr
Copy link
Contributor Author

wclr commented Dec 15, 2023

I think the problem is quite clear, there may be some "unexpected" content that is not very friendly to the fast traversing, inside the project dir for any reason a user needs it. So you ask the user: why do you have it there, the user answers: why not, I need it here, at least for now. What is the reason not to have it, are the projects that want to have spago.yaml in the root should have this kind of limitations?

@f-f
Copy link
Member

f-f commented Dec 15, 2023

@wclr you opened a ticket because you think you encountered an issue with this software - that's good and I appreciate that.
Now, me and Jordan would like to help diagnose what the issue is, because we are at a stage where we don't know if:

  1. the software is behaving as it's expected to behave (and your setup doesn't fall into the list of supported usecases)
  2. or the software has a problem, and it needs to be patched.

We don't know. So in order to figure out what we're dealing with here we would either need:

  1. a repository that we can clone, and verify ourselves that Spago is taking a long time, and eventually inspect what's happening with it
  2. or if that's not possible, at least a log of the output of spago build -v, so we can look at the debug info

I'm afraid that without either of these things there's not much we can do.

@wclr
Copy link
Contributor Author

wclr commented Dec 16, 2023

I am just not sure how to implement a viable repro here, should I recreate a structure of my real-life project, and all al the deps there? I don't think it is a good or possible option. Even it doesn't "hang for ever" but takes several seconds to run on this step, it is absolutely not nice.

What problem are you expecting to discover here? Glogging as any a massive fs operation can be quite expensive depending on the content eps. if you run on some not very fast fs.

Why ignore or include pattern is needed?

  1. There can obviously be issues with some content that is not friendly for fast globbing and will cause unneseasy delay, or maybe some other issues.
  2. Just to make spago run efficiently, so it would not take effort each time searching where there is no need to search (let the user decide).

From bazel docs:

https://bazel.build/run/bazelrc

.bazelignore
You can specify directories within the workspace that you want Bazel to ignore, such as related projects that use other build systems. Place a file called .bazelignore at the root of the workspace and add the directories you want Bazel to ignore, one per line. Entries are relative to the workspace root.

On the other issue you mentioned:

Sure, and we have encountered #1100 of #918 in the past, and fixed them.

You didn't exclude ignored locations from globbing (though you have excluded them later byanalyzing in the code), so it is not a very performance wise solution.

@f-f
Copy link
Member

f-f commented Dec 16, 2023

I am just not sure how to implement a viable repro here, should I recreate a structure of my real-life project, and all al the deps there?

Sure. It doesn't have to be exactly as your current project, it can approximate it as long as it reliably demonstrates the problem. A few seconds of runtime are normal (the compiler takes as much to startup on big projects), a minute starts to become problematic.

What problem are you expecting to discover here?

If there's a problem I would like to explore the failure condition, instead of just implementing whatever suggested solution without much thinking.

@wclr
Copy link
Contributor Author

wclr commented Dec 18, 2023

If there's a problem I would like to explore the failure condition, instead of just implementing whatever suggested solution without much thinking.

I can tell you the problem, it is that globbing is not always cheap. Solution is make finding the needed files as cheap as possible. As I pointed out above in the issue #918 solution to exclude patterns from gitignore files is not the most efficient one.

Although this capability is not implemented in the fast-glob, there is an issue: mrmlnc/fast-glob#265. the solution proposed is to use @nodelib/fs.walk as faster alternative for dynamic filtering.

Also there is and issue about discussed glob performance problems sindresorhus/globby#50, it stays though:

The most important thing is that we skip ignored directories like node_modules as early as possible. It's the source of a lot of performance problems.

@JordanMartinez
Copy link
Contributor

Rereading this thread, it sounds like @wclr has a repo with top-level folders that have a lot of content within them. And since AFAIK Spago doesn't assume that any subpackages are only one-level deep (e.g. ./subpackage/spago.yaml), it is being assumed that Spago is walking the entire file tree to find spago.yaml files in directories the user knows does not need to be examined (e.g. Spago: "Is this a spago.yaml file in node_modules/some-package/that/has/a/deep/directory/structure?").

@wclr, is that correct? And arguably, if we wrote a script that produced a whole lot of such directories and subdirectories, we'd be able to produce this "Reading Spago workspace configuration..." hang issue that you've described?

@wclr
Copy link
Contributor Author

wclr commented Dec 19, 2023

Rereading this thread, it sounds like @wclr has a repo with top-level folders that have a lot of content within them.

Right, it is quite an old project with lots of different content, but gradually migrating to PS as the foundation. Anyway, there may be different situations and user contexts why a project (its directory) may have a lot of non-related to PS code/packages content.

And arguably, if we wrote a script that produced a whole lot of such directories and subdirectories, we'd be able to produce this "Reading Spago workspace configuration..." hang issue that you've described?

There is no need to spend time on this, really.

The current method of finding nested spago.yaml files is not efficient, it is quite obvious. As it first tries to glob all the content (excluding only .spago dir) and then manually filtering glob results with gitignore patterns. Globbing without proper ignore patterns can be significantly slow (as in my project), I often encountered such a problem in practice when dealing with globbing (and the links above should confirm this point).

And as it is not obvious how to correctly convert gitignore patterns to glob patterns for ignoring it would be better to implement custom walk though the directories. filtering them on the go and not visiting paths that should be ignored. This should speed up the FS search step a lot and will make result-filtering procedure unnecessary.

@wewei
Copy link

wewei commented Dec 29, 2023

I got a repro on my Window 10 machine. It doesn't take forever, but failed to make some network request. I got the following error in about 1 minute.

D:\Code\purescript-playground>spago build
Reading Spago workspace configuration...

❌ Couldn't fetch package set:
  There was a problem making the request: request failed

When I used spago build -v, it turned out to be stopped at fetching the packages.json file from raw.githubusercontent.com.

D:\Code\purescript-playground>spago build -v
[      21ms] CWD: D:\Code\purescript-playground
[      31ms] Global cache: "C:\\Users\\Wei Wei\\AppData\\Local\\spago-nodejs\\Cache"
[      33ms] Local cache: "D:\\Code\\purescript-playground\\.spago"
[     120ms] DB: Connecting to database at C:\Users\Wei Wei\AppData\Local\spago-nodejs\Cache\spago.v1.sqlite
[     137ms] Reading Spago workspace configuration...
[     138ms] Reading config from spago.yaml
[     181ms] Selecting package my-project from ./
[     182ms] Reading the package set from URL: https://raw.githubusercontent.com/purescript/package-sets/psc-0.15.13-20231223/packages.json

This some is some times blocked by my ISP. I use a proxy to workaround this, but it seems the spago command doesn't use my system proxy.

It's strange that the curl works for me.

curl https://raw.githubusercontent.com/purescript/package-sets/psc-0.15.13-20231223/packages.json

@f-f
Copy link
Member

f-f commented Jan 14, 2024

@wewei you are seeing a different issue: the original ticked is about how quickly we traverse the tree of directories while your issue is about Spago being slow on a network request.

The node runtime doesn't seem to have support for system defined proxies, so this is a separate feature we'd need to implement separately. Please open a new ticket if you're still interested in having such feature

@f-f
Copy link
Member

f-f commented Jan 14, 2024

@wclr

The current method of finding nested spago.yaml files is not efficient, it is quite obvious

Maybe? I am personally not going to have a look at this unless we have a benchmark that demonstrates the problem.
You are welcome to put together a patch to change this behaviour if you believe it's problematic, but I won't merge it unless it comes together with said benchmark that can show me, any contributors and future maintainers that the current way of doing things is a problem.

@f-f
Copy link
Member

f-f commented Jan 26, 2024

Closing this in favor of #1182, since that one has some level of detail and is more actionable.

@f-f f-f closed this as completed Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants