-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: symlink support revised #381
Comments
I did continue work on the rework in my fork and I'm currently using it. Though I did strip out all the symbolic link management and it works fairly great though I've found a few racey data access. I'm unsure what you mean by symlink database though. This I guess is going to be the entire reason for the architecture change as sym links do add a huge headache. It'll be useful to know you're thoughts on structure and handling and if it needs to interact with multiple back ends? I agree with most of what you say and I do see being able to reuse a lot of what is already there. I just don't want to make assumptions as unsure on symlink database inner workings. On a side note, I found some fundamental issues with rb-fsevents, such as it ignores kFSEventStreamEventFlagMustScanSubDirs which of course isn't ideal. Though I've not seen any real world occurrence as of yet! Just thought I'd add in here while we discuss futures :) |
A symlink database would basically handle reporting changes. e.g. if So the symlink database is just to later translate changes into symlink targets changed. It would be separate from the existing record/filesystem database for easier testing, etc. It's just a "last step". Most of Listen will work on real paths. So symlink -> real happens at the beginning (to remember what the user really cares about watching) and real -> symlinks happens at the end (to translate real changes into all the possible combinations of symlinks to report them as changed). So this is "outside" the normal processing. It could even be added right now, but I'd improve the architecture first.
I wouldn't know. rb-fsevent isn't optimized (it has it's own wait loop), but it could be integrated tighter with Listen for better optimization - given the architecture is better. The idea is: the faster rb-fsevent can respond, and the more "hints" it can give about changes, the more it can be optimized. And the more Listen's parts can be reused. E.g. file size may be a faster and more reliable "first check" rather than mtime. I think the overall "philosophy" behind the new architecture is: "directory diff based". So it's about comparing directory snapshots over time. Which is very different than storing an "active" state of the filesystem in the Record structures. Scanning and diffing directories seems like it's expensive, but it's what actually happens on Polling and Darwin adapters (they share most of the same code paths). My earlier mistake was over-optimizing Record access instead of turning it into 3 separate databases:
And these could actually be in sqlite anyway. Tree structures are too mind-boggling to work with. They'd have to make heavy use of classes to stay maintainable. |
Oh I see. That sounds great. Regarding the third database, flat list of directory snapshots, isn't that what Record already is? And Record is currently really simple tree structure. Maybe it's useful to have Record an interface, then you can have a memory-based tree if you wanted and didn't care about offline changes, and also have a SQLLite option (or even whatever your application uses as a backend). All under one interface. I'm not really convinced the tree structure for Record is mind-boggling though or in need of replacing. It's DirectoryPath->State mapping and that's it. And any SQLite implementation would be the same, surely? Unless you're referring to some other parts? In my fork on the rework I stripped symlink and made Directory support two scan modes, recursive and shallow. I use it to do incremental syncs to vagrant machines for developers and it's blazing my fast. I've now got a need to add symlink back in as we are using them more and more with "composer" libraries. I don't mind drawing that work out to help start some improvements as I know what you mean by difficult to mock/test as I've somewhat neglected TDD. And it would be great to see it made acceptable for wider consumption instead of hidden away :) |
No, it's still a tree, see this TODO: https://github.com/guard/listen/blob/e21066c/lib/listen/record.rb#L7 Also, while the tests are much better than they used to be, they're still hard to maintain: https://github.com/guard/listen/blob/e21066c/spec/lib/listen/record_spec.rb#L285 It's best to switch to objects representing snapshots and not a "build-as-you-go" structure. The idea here is to also avoid any filesystem operations outside scanning. Ideally only the backend would touch the filesystem, report an updated directory snapshot, and then the filesystem is never accessed until the next event. The biggest problem with understanding, maintenance and testing right now is that the filesystem is accessed multiple times between the even and the callback. This makes most of the codebase non-deterministic.
I worked hard to keep it simple, fast, properly covered and memory efficient. It still makes too many "hidden" assumptions. And edge cases are almost impossible to map out in reasonable time. E.g. the HFS filesystem uses 1-second mtime resolution (like FAT). This has to currently be "kept in mind" when changing anything. It's also hard to simulate in tests, too. The Record simply has too many responsibilities for a single structure.
One word: symlinks. It's better to think in terms of inode numbers. E.g. you have a directory name (a possible symlink) that resolves to an inode number. And the inode is basically what you're watching, not the "directory name". So even if a directory is renamed, you're still watching the same inode. So a better "structure" would be: If you rename a directory, the directory doesn't change. What changes is that directory's entry in their parent. (This makes detecting moves/renames more robust, while avoiding unnecessary dir scanning and file stat/md5). It seems more complex, but you're dealing with much simpler classes/objects. Much easier to test, less overall edge cases to support and it's all deterministic and uninfluenced by timing. Not to mention - it's much easier to find code bottlenecks and e.g. break the work into threads. (For e.g. a faster response time). Again, this would allow reusing components for things like filesystem syncing. So creating a Ruby implementation of RSync would be possible and trivial.
I'd say that's the result of the current codebase, so I don't hold that against you ;) Also, the "snapshot" feature would make it easier to create use cases. E.g. someone on OSX could just dump the structures - and I could reproduce an OSX-specific timing-related bug in pure code. This would also allow a "record and replay" option. Kind of like Wireshark supports. And that would also speed up integration tests. (And they'd be backend-independent anyway). You'd no longer need to test any backend, either - except to the extent that they all report events in a unified way. (Basically - just returning current snapshots of changed directories + optimizer hints). And if those directory snapshots could be sent over TCP, you'd end up with a very low latency solution. (Sure, the packets would be large, but all the processing and optimizing could be done on the remote side). So you could implement a high-performing "reverse-shared-folder" on VMs, etc. So tons of issues could be solved and new options could become available. |
Symlinks underlying are files containing a text path to the target. If the target is renamed the symlink becomes broken. Inode numbers are only involved in hard links which is only for directories and complicates matters further and to be honest I'm unsure how the various file change notification systems in the various OS react to hard links. Also when you look at issues with Filebeat by Elastic it is becoming clear that with some log rotations and likely in other circumstances too, inodes, especially on files, get reused very very quickly so unless the file change notifications contain them it's highly likely safer to keep md5 signature scanning. Especially if the desire is to convert file change notifications into directory snapshots as that might mean keeping inode information is difficult. Using file paths instead of inode is better I think too as if you did have a dump of a snapshot it would be more readable. And it's good enough for You also get into more cross platform wilderness as Windows uses a device serial number and two index numbers per file. And depending on Linux variant you have either unsigned or signed values etc. Not sure how Ruby would expose any of it either. This all sounds pretty awesome though and well thought out! Generating directory snapshots first and then comparing them to previous is a great idea and I can see it would make testing far far easier. Depending on Record layout it might even be faster as each call is a bulk for the snapshot rather than trickle calls. |
Yes. I'd ignore hard links at first. I don't think they really matter. As
|
Intro
The goal of Listen is portability. This is tricky, because each backend behaves differently. This means things like symlink support and TCP support had to be put aside for now.
But, with a better architecture, it should be possible to extend Listen without all the pitfalls of doing so.
Milestone: A new architecture.
Currently it goes something like this:
The new architecture would work something like this:
Why
Current "architecture mistakes" of Listen:
First step
Just getting every backend returning a list of changed directories. If a backend can "guarantee" that all file are reported properly (Linux and Windows backends only), then those files can help as "optimizer hints".
Otherwise, full "directory scanning" should be the default. E.g. if Linux reports that
foo.rb
was modified, then it reports a change in the directory containingfoo.rb
- and maybe hinting that onlyfoo.rb
was changed in that directory.Correctness is more important for speed, so any "optimization" should be almost possible to disable at all times. (To keep the codebase simple).
Full directory scanning seems like overkill, but it's absolutely necessary for Listen to be maintainable, testable and "correct". This may slow things down on Linux a tiny bit, but that shouldn't be noticeable. (Especially since new optimizations would be possible).
Also, "full directory scanning" is the only sane way to provide support for symlinks.
If anyone is interested on tearing up the codebase to make this happen, I'd be more than happy to support you.
The text was updated successfully, but these errors were encountered: