-
-
Notifications
You must be signed in to change notification settings - Fork 586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal - simplify initial scan #410
Comments
👍 Could you extract your work to a module and submit a PR here? We'll see how well it'll play with tests. |
@kmalakoff also if your |
@paulmillr sure, I'll extract and submit to start a concrete back and forth. On minimatch / micromatch, (In fact, I prefer whatever the recommended solution for peer dependencies is so globbing is optional, but that is for another discussion / refactor). |
Great. |
Just FYI, a replacement for readdirp has been on my radar as well. Likely something that could act as a replacement for node-glob also. My primary goals were to use micromatch for the globbing and reduce I'm concerned about the performance implications of some of what you're describing. Specifically, while your approach may be much more gentle on the node event loop, it may take noticeably longer to walk large file trees. I've certainly been paying attention to the discussions lately about backpressure and the points you've been bringing up in thinking about how I want to approach this. Hoping to find ways to improve upon the situation on all these fronts.
I disagree about #229, I think it's more likely a permissions thing on Windows for special fs entries at the root of the drive. graceful-fs is a safety net - I agree that it's worth looking for ways to avoid ever triggering its exponential backoff. Backpressure has not been implemented in readdirp, and I'm not sure I understand what you have against it. It is likely an important component in making a smoother and more efficient solution. |
@es128 we don't need to include globbing in the readdirp — as @kmalakoff mentioned. We can do the job on our own. |
@es128 totally. On #229, I quickly chose an issue was easy to find since I have limited available time (startup founder), but I've seen many as I was scratching my head figuring out what to do about performance...Either way, you know scalable performance and robustness is a problem regardless of which issue I refer to! 😏
I do have a pre-stat version and a post-stat version of walk so I'll release both. I use the pre-stat version if my filter depends on knowing whether something is a directory / file vs post-stat for paths-only based filtering. I'm not sure if chokidar can make that determination on behalf of the user (but you could expose an option) so I would treat that in a separate optimization phase since after you are using smaller, simpler approaches, you can optimize edge cases more easily....right now, it is nearly impossible to to optimize readdirp (I considered optimizing it in the issue I referenced) because does some strange things like directory batching which is hard to work around (with a large folder, I have to wait 30 seconds before it starts emitting the things I'm interested in) so that's sort of at the heart of the problem and proposal.
Yeah, I know... That said, I only have limited time available so there is an opportunity cost to me to do this. If you strongly disagree or believe you won't seriously consider using it, I will get back to my startup. @paulmillr and @es128 should I make the time to do this? |
Just for chokidar that may be the case, but part of my plan is to also provide a generic drop-in replacement for node-glob in order to close the loop on the caveats discussed in gulpjs/glob-stream#56 and move the ecosystem more towards standardizing on micromatch vs minimatch. But complete lack of glob awareness in the fs walker may lead to some inefficiencies, like traversing a subdir unnecessarily. |
@es128 you wrote: Here is point 1 in my proposal: |
A glob can match a dir, but none of its contents. |
@kmalakoff can you show us what you've done so far without investing time on extracting it into a module? |
@es128 no glob functionality in it yet, no tests, and it is a little messy. I'll look through the code for internal references and will grant you access. It's pretty simple... PS. I'm not a big fan of big module approaches, but prefer small modules that can be configured by the user. I spend so much time looking at module dependencies for cross-platform code so things like micromatch do strange lazy loading that isn't compatible with webpack so I always swap it out in the browser. People need choices. |
Here's walk-filtered:
By having a simple |
Here is a replacement for readdirp:
I couldn't get the depth to work so it is commented out. Also, a lot of the entry stuff can probably be cut out. |
@es128 I'm using gitignore which does something similar (handles different logic for directories and files) and it is working for me. |
Cool. I'll evaluate it. Really appreciate this and all the feedback you've been providing @kmalakoff. Wish I could replicate your app/environment to be able to look more closely at what's locking up with readdirp. Curious how node-glob fares under the same circumstances. I do remain skeptical that going fully serial will be the right move for chokidar. I'm thinking that some sort of bounded parallelism, limited based on a reasonable estimate at what most file systems will handle well, is more likely to hit the sweet spot. |
@es128 my pleasure. I hope it helps...
I know very well what the problem is...it batches up directory processing rather than emitting file by file and directory by directory, it has no control over parallelism, etc. I've spent too long looking through its code... |
You're referring to the The premise is that the unbounded hammering of fs calls overtaxes the file system at some point, right? So would you think that introducing some sort of queuing (with tunable limits) to control the pace of parallel calls would be a move in the right direction without going fully serial? |
Yeah, I just tried to copy the As for the back pressure idea, I think that sounds great for some use cases, but not totally needed for mine (although if I could throttle it way back, it would be fine as long as it is has smooth, controllable performance and resource usage). Ideally, it should be a separate module that is well-tested and given as an option to the library user. The problem is that people use chokidar in different ways, some people want to max their system and use graceful-fs because they are writing a utility that runs on developer machine or devops server and some people want it running calmly and slowly in the background without any urgency (like my application). For me, I'm using chokidar in electron and the browser so I need to prioritize rendering performance / responsiveness over speed of traversal. Also, I am more interested in giving priority to the work done after detecting a file in many cases (since my filtering is to bring up relevant files to the end user that they potentially want to start interacting with). I've used async, queue-async, bluebird, Node streams, etc and each has tradeoffs in terms of controllability. If you design or find a nice little control-flow back pressure library, there are many ways to hook it in. For example, since the filter function in walk-filtered is asynchronous and serial, you can also use it to rate limit by not calling the callback (treat it like done / ready) until you are ready for more data, but when you are under utilized, keep consuming as much as possible. Of course, there might be a better interface since what I'm saying is still serial unless you provide a buffer / read-ahead length (sort of sounds like a stream, but they are complex, heavy to bundle on the client, etc, but you get the picture). Just need to find the right control API that supports the major use cases and then plug in the best implementation for the user's need or give them the right level of control. For example, I've used queue-async in the past since you just give it a parallelism count, but not sure if that is enough control. That said, it is simple and flexible for many situations. Control flow should be a solved problem so hopefully, there is already a good library on npm for what you describe. |
What exactly is the memory-based fs shim you're using? Would like to take a look at how it handles I wonder if it's causing processing that should be asynchronous to become synchronous (such as immediately calling the callback without |
@es128 Thank you for helping! I use immediate and I've been writing fs-memory, but the problem isn't in the browser (and the library is a work-in-progress, eg. not for production), the problems were more under scale in electron with the actual filesystem and real data-sizes. The memory version is more for development and leaving the door open for a browser (non-electron) version of the application in the future. I have very limited, available time so I'm just bring things to the MVP level. I think you are following a red-herring...like I said, I'd rather you focus on making chokidar more performant than my problems since I have already optimized my application to meets my needs. I'm just trying to share what I did with chokidar to work around readdirp and help the broader community. |
@es128 @paulmillr I'm going to spent a couple of hours to work through the tests to see where I get to. All of this talking could be spent actually doing instead! |
@kmalakoff thanks, the discussion is useful. As stated, I've already been thinking about coming up with a new fs walking solution, so I'm just gathering as much intel as I can about considerations that should go into it. While your overall use-case may be unique, it has helped to clarify how I think about some of the problems. |
Interesting. A bit unrelated, but fs-memory resembles a whole different thing I've been thinking about, probably intended for solving a completely different issue common to build systems. Whenever I get around to working on it, I may try to start by using fs-memory as a basis. |
@es128 cool. I want to basically port all of the node fs tests and put it back to ES5 syntax before other people use it seriously since I'm just implementing as I go. Right now, I'm bundling up a git project using a loader module in webpack and then running git in memory in the browser with js-git. Still early days though... Feel free to use it as it does handle quite a few cases already. I can extend and add tests as you need. |
@es128 @paulmillr I've created a first version for you to review. Here is: https://github.com/kmalakoff/walk-filtered and here is the updated version of chokidar: https://github.com/kmalakoff/chokidar. I've submitted a pull request, but it is a bit premature until walk-filter's each method is optimized (see below). As for the parallel / serial discussion, now that the iteration is very simple and pluggable, I've benchmarked a few approaches:
You can see that the 'each' API can be adapted to different iteration strategies and I've made it configurable. I chose "Parallel limit of 50 with graceful-fs" as the default since it is a "safe"-ish choice, but it is not totally safe since the limit is a local limit instead of a global limit so it doesn't scale cleanly (eg. 50 + x*50 + ...). There is definitely room to improve... What I'm thinking:
I'm out of time on this, but I think it is nearly there. All the tests are passing and the code is pretty small / tight. @es128 if I have convinced you, would you like to write an optimize the array each function? |
@kmalakoff I have more to review, but kudos on all this good work. My next step is to play with your benchmarks - try it against different sized file trees, put it up against readdirp, see what happens with even higher levels of parallelism, etc. This would be a significant change to solve a problem that has been mentioned by very few users, so I feel we need to tread very carefully. Sorry in advance if I end up taking a long time to assess and do due diligence on this. I think exposing options for tuning this will be valuable, but I'm worrying a bit about how to describe to users what considerations would go into defining their own Also, regarding the PR, one initial comment is that I think the new internal module should be named something other than |
@es128 Awesome! It should be easier to play and optimize now.... By the way, I'm not too concerned about these types of details and you can decide what you think is best. On the naming, I'm going to release readdirp-walk so that should be fine soon enough. I just tried to keep it simple for the first pull so you can compare by just changing between As for |
@es128 OK. I've moved readdirp into its own module (using micromatch) and made sure the readdirp tests pass and updated the pull request: https://github.com/kmalakoff/chokidar/commit/e37845218d4f0906d97113fedf30df3137388805 Definitely, take your time. I'm going to use a private version with serial processing since even my reduced parallelism isn't 'unobtrusive enough' for my application (eg. serial). If you want access to any of the repos, I am happy to give it to you. |
@es128 I've added a concurrency option for controlling parallelism. It might be the type of simplicity that you would be looking for. https://github.com/kmalakoff/chokidar/commit/4ccb313745b3d729c2f64a60b3e03206c6611b9b Let me know if you have any questions or ideas for load testing and the corresponding APIs. I'm looking forward to what you come up with! |
@es128 and @paulmillr I've cleaned up the READMEs on the two modules I've released to help with this that outline some of the problems with readdirp and outlining the fine-grained options:
I'd like to make sure the concurrency isn't considered the only benefit of this work. I'm not accumulating memory, have been very stringent on only doing what is necessary before checking for filter (eg. early exits), etc. Plus, I think if a bit of time is spent on optimizing the iteration Most importantly this lays some more solid foundations for dealing with some of the edge cases that people bring up and optimizations you would like to perform since the modular approach makes it easier to experiment with. Examples: you could now easily bypass readdirp and use walk-filtered directly to avoid pre-stating all of the files and folders, you could generate the entry information lazily (only as and when needed), etc. Agreed with @es128 that a big change like this needs to be seriously tested and vetted, but it definitely has the potential to be a good time-saver to many people given ubiquity of chokidar in people's daily work plus it will help increase productivity in your optimization efforts.... I think you are going to be very satisfied...be sure to track before and after benchmarks! |
+1, very interested in seeing this happening to improve the user experience in VS Code on Mac/Linux. |
This looks very promising, it seems that there is no reason not to use this? |
v3 resolves this with new readdirp made by us |
I wrote the author of readdirp a few weeks ago because after tracking down the source of my electron application lockup at startup, I am pretty convinced that its approach to blasting the filesystem with a huge number of parallel requests is flawed / not scalable. Plus, I think this approach is the source of many bugs or performance issues (such as #229) and has caused workarounds like needing graceful-fs, backpressure, etc.
In my personal project I have replaced readdirp in chokidar with a simple walk implementation to meet my needs and things are smoother and much faster for me (no numbers, but first want to float an idea).
Basically, my approach was to make it massively serial (async.eachSeries) so that it played nicely at any scale and to simplfy it since there is some funky async logic in it. Obviously, some parallelism could improve throughput, but it adds complexity to manage performance for large scans (but still is probably desired).
Proposal: replace readdirp
I could extract my filtered walk into a small library. My current API is simply:
It would do lstat before every filter call like you rely on, but things like globing and depth would be left up to the library user.
Customizing it to chokidar's needs:
This change would require a small to medium refactor of chokidar so before considering me developing the code and some performance tests, I was wondering if you agree with the philosophy of the proposal (serial and no graceful-fs), have feedback, etc. Also, I'm not sure about all of the edge cases so I might be proposing something with a serious flaw.
Interested? Suggestions (like an infinitely scalable and simple parallelism strategy)?
The text was updated successfully, but these errors were encountered: