Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API for retrieving the list of files that will be accessed #1659

Merged

Conversation

paul-dingemans
Copy link
Collaborator

Description

The list of files which will be accessed by KtLint when linting or formatting a given path can now be retrieved with the new API KtLint.getInputPaths(path: Path): List<Path>. Currently, the .editorconfig files are the only files which will be accessed during linting or formatting in addition to the source files itself.

This API can be called with either a file or a directory. It's intended usage is that it is called once with a directory path before actually linting or formatting files. When called with a directory path, all .editorconfig in the directory or any of its subdirectories (except hidden directories) are returned. In case the given directory does not contain an .editorconfig file or if it does not contain the root=true setting, the parent directories are scanned as well until a root .editorconfig file is found.

Calling this API with a file path results in the .editorconfig files that will be accessed when processing that specific file. In case the directory in which the file resides does not contain an .editorconfig file or if it does not contain the root=true setting, the parent directories are scanned until a root .editorconfig file is found.

Closes #1446

Checklist

  • PR description added
  • tests are added
  • KtLint has been applied on source code itself and violations are fixed
  • documentation is updated
  • CHANGELOG.md is updated

In case of adding a new rule:

/**
* Get the list of files which will be accessed by KtLint when linting or formatting the given file or directory.
*/
public fun getInputPaths(path: Path): List<Path> =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Gradle Plugin perspective, we want to start watching input files to be able to reload ktlint caches when their content changes. The preferred way is to call KtLint#reloadEditorConfigFile, which name suggests it's meant to be used with editorconfig files only.
So, I started wondering if these two shouldn't be symmetrical 🤔 With getInputPaths() and reloadEditorConfigFile each library consumer, who also experiences ktlint config being cached across re-runs, will have to filter out unknown files. If we had a separate method per each type (getEditorConfigPaths + reloadEditorConfigFile) or more generic methods (getInputPaths + reloadInput) it would be clear how they interact with each other.

I know currently, mostly .editorconfig files will be returned here, I'm just trying to future-proof the api

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it works to rename getInputPaths to getEditorConfigFilePaths and leave out the source file itself which I added solely because the getInputPath sounded more generic ;-)
I see no need to the generic methods (getInputPaths + reloadInput) until we actually have other sources of inputs.

Comment on lines +27 to +29
if (path.isDirectory()) {
result += findEditorConfigsInSubDirectories(normalizedPath)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started wondering if there shouldn't be a way to exclude searching in subdirectories 🤔 I initially thought of passing project root directory as the path, but that would result in traversing through all files and folders, including build, docs or resources directory, which would be wasteful.

Gradle plugins would have to be smarter, and we'd have to figure out a way to identify the root folder for all directories containing Kotlin sources, which unfortunately I failed to do so :/ For Kotlin plugin we get SourceDirectorySet#getSrcDirTrees which would allow us to identify multiple root directories, but I'm unaware of a similar way to obtain such information for other plugins (for example, for Android Gradle Plugin).

Given all of the above, I started leaning towards ignoring subdirectories, for performance reasons (and I believe sharing common editorconfig settings for a compilation unit is a fairly common situation). Having the subdirectories' traversal optional, ktlint consumers could switch to a more precise input calculation if they are able to identify a proper top-most path.
Do you have any thoughts/ideas/advices here?

I can link my comment on the same topic: jeremymailen/kotlinter-gradle#265 (comment)

Copy link
Collaborator Author

@paul-dingemans paul-dingemans Sep 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think the performance impact of scanning all subdirectories it too big? This call should be made only once. For testing KtLint I always run KtLint on a collection of open source projects which in total contain 2,800+ directories and more than 5,000+ files (both excluding files in .git and build dirs) and the scanning is really fast.

I have not encountered many use cases in which a project contains multiple .editorconfig files. But I think it would be really confusing for developers if the .editorconfig file in the root of the project and its parents is treated differently from .editorconfig files in subdirectories.

As with all performance issues, my advice would be first to measure before deciding to optimize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think the performance impact of scanning all subdirectories it too big?

I don't know. I haven't ever done any sort of file performance things. My assumption was that doing things takes more time that completely avoiding it 😛

This call should be made only once

From Gradle Plugin perspective, it's once per task invocation, so every time someone calls ./gradlew ktlintTask.

As with all performance issues, my advice would be first to measure before deciding to optimize.

Sure, sounds reasonable 👍 I'll make sure to do extra profiling when switching to the new API 👍 Thanks for answering my doubts (and for finding time to work on the new api) 🙏

@paul-dingemans paul-dingemans merged commit 8e78581 into pinterest:master Sep 26, 2022
@paul-dingemans paul-dingemans deleted the 1446-expose-inout-paths branch September 26, 2022 18:38
@paul-dingemans
Copy link
Collaborator Author

@mateuszkwiecinski The change has been merged to master. Can you please test the API against the latest snapshot?

@mateuszkwiecinski
Copy link
Contributor

@paul-dingemans Thanks!
I tried to swap https://github.com/jeremymailen/kotlinter-gradle to the new API, so I passed project root as the reference for editorConfigFilePaths and I started observing noticeable delay. Configuring all the tasks took more than 40 seconds, instead of regular 2-3 seconds. I assume it's because kotlinter configures multiple tasks, one for each sourceset in a module, so effectively in my case I ended up calling editorConfigFilePaths hundreds times, once per each Gradle task, where each call traverses through all directories within the project 🤷
I later did a proof of concept and started calling Ktlint.editorconfig with hardcoded "/src" subdirectory to exclude the build folder, and I was still able to observe a slight 1-2s delay.
At the end I did some additional measuring and I compared duration of a editorConfigFilePaths({project_root}) vs editorConfigFilePaths({project_root}/.editorconfig) which was my way of trying to ignore subdirectories. The results were 500ms vs 200μs (🤯), which I read as the cost of going through all module's subdirectories. (I did my tests on a fairly large Android project)

So yeah, the conclusion I draw is I'd have to do some research and see if I can identify, in an idiomatic way, the top-most folder I should pass to look for nested editorconfig files so looking for nested editorconfigs doesn't come at that significant cost :/

If possible, I'd still propose to add ability to parameterize the search, so consumers could try to aim to use the default implementation, but always have an ability for a more performant and less correct alternative. If that's not something you'd be open to introduce, I'll have to dig into Gradle, AGP and Kotlin Plugin APIs and will be beack with similar, more precise benchmarks based on a proper implementation (assuming I'll manage to come up with such)

@paul-dingemans
Copy link
Collaborator Author

I tried to swap https://github.com/jeremymailen/kotlinter-gradle to the new API, so I passed project root as the reference for editorConfigFilePaths and I started observing noticeable delay. Configuring all the tasks took more than 40 seconds, instead of regular 2-3 seconds. I assume it's because kotlinter configures multiple tasks, one for each sourceset in a module, so effectively in my case I ended up calling editorConfigFilePaths hundreds times, once per each Gradle task, where each call traverses through all directories within the project 🤷

I have very limited knowledge and expertise with Gradle. But having hundreds of executions of the editorConfigFilePaths method is quite extreem. The editorConfigFilePaths can be optimized by using an internal cache but that of course will only work if the same ktlint instance is used by all Gradle tasks.

Is there any chance that you can check that the number of invocations of editorConfigFilePaths really relates to the number of sourcesets? Or is it maybye called once per source file? It would also be helpful if you can share your branch of kotlinter-gradle plugin and the sample project you're using so that I can have a look as well.

I later did a proof of concept and started calling Ktlint.editorconfig with hardcoded "/src" subdirectory to exclude the build folder, and I was still able to observe a slight 1-2s delay.

Scanning build and .git folders is indeed wasteful as they contain lots and lots of directories. .git folders are skipped automatically as it (at least on Unix) is considered a hidden directory because it starts with a .. Scanning the src folder indeed prevents scanning both the build and .git folders, so that is a smart change.

At the end I did some additional measuring and I compared duration of a editorConfigFilePaths({project_root}) vs editorConfigFilePaths({project_root}/.editorconfig) which was my way of trying to ignore subdirectories. The results were 500ms vs 200μs (🤯), which I read as the cost of going through all module's subdirectories. (I did my tests on a fairly large Android project)

500ms does not sound bad, except when you do it hundreds of times of course ;-)

So yeah, the conclusion I draw is I'd have to do some research and see if I can identify, in an idiomatic way, the top-most folder I should pass to look for nested editorconfig files so looking for nested editorconfigs doesn't come at that significant cost :/

Would it be of any help in case you can pass a list of paths to editorConfigFilePaths instead of a single path? Especially, if this could be combined with an internal cache, it would result in a single sweep to read all subdirs.

If possible, I'd still propose to add ability to parameterize the search, so consumers could try to aim to use the default implementation, but always have an ability for a more performant and less correct alternative. If that's not something you'd be open to introduce, I'll have to dig into Gradle, AGP and Kotlin Plugin APIs and will be beack with similar, more precise benchmarks based on a proper implementation (assuming I'll manage to come up with such)

I am still not convinced that paremeterization is the solution. At least it seems a solution for your problem based on the assumption that most projects do not use multiple .editorconfig files. What worries me is the chance on 'backfire' to ktlint that it does not respect the .editorconfig files in some cases.

@paul-dingemans
Copy link
Collaborator Author

Caching has been added. Please checkout the new snapshot. There should be some log lines:

Scanning file system to find all '.editorconfig' files in directory '$path' scanned $visitedDirectoryCount directories in $duration ms
Creating cache entry for path $path with value $cacheValue
Retrieving EditorConfig cache entry for path $path

mateuszkwiecinski added a commit to jeremymailen/kotlinter-gradle that referenced this pull request Oct 9, 2022
mateuszkwiecinski added a commit to jeremymailen/kotlinter-gradle that referenced this pull request Oct 9, 2022
@mateuszkwiecinski
Copy link
Contributor

Is there any chance that you can check that the number of invocations of editorConfigFilePaths really relates to the number of sourcesets? Or is it maybye called once per source file? It would also be helpful if you can share your branch of kotlinter-gradle plugin and the sample project you're using so that I can have a look as well.

I'm fairly confident I know when things are called by Gradle. I aim to make the new API to be called once per sourceset, so it finds only related editorconfig files. Here's the branch: link.
At this very moment it's called once per project, but results in an invalid behavior: If someone modifies the content of a .editorconfig within test sourceset, the main sourceset task will be unnecessarily re-executed (because Gradle think it's one of main task's input, it's not going to be considered as up-to-date).

Would it be of any help in case you can pass a list of paths to editorConfigFilePaths instead of a single path? Especially, if this could be combined with an internal cache, it would result in a single sweep to read all subdirs.

You mean the plugin would have to scan all subdirectories, find editorconfig files and only then call KtLint#editorConfigFilePaths? That's what both Gradle plugins do already - we try to guess which files will be accessed by ktlint. I might have misunderstood something here, but if plugins had to scan directories on their own I'm afraid we'll get back to the original issue the new api tries to address.

Caching has been added. Please checkout the new snapshot. There should be some log lines:

I tested the latest snapshot and there is an observable improvement, you managed to reduce time to scan directories from 200μs, to 15-17μs 🚀
image
And again a note: I'm making only simple manual testing with measureTime - these are not serious performance tests

The current state of my wip branch is as close I could get trying to find balance between correctness and performance. Ideally, I'd be able to replace the TODO projectRoot directory with sourceset "root", but I'd probably have to get in touch with AGP maintainers and ask to add new apis (to match Kotlin plugin behavior) 🤷
As a conclusion, I believe the current proposal of the API and its implementation should be fine for now. It should be enough for the first iteration - I'll try to figure how to improve things on the plugin side. I'll let you know if I manage to move things forward :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose information about files/resources that will be accessed when invoking ktlint
2 participants