-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: speedup docstring check consecutive runs #57826
CI: speedup docstring check consecutive runs #57826
Conversation
…nce' into improve_docstring_check_performance
Wow - my hat is raised above my head @datapythonista fancy taking a look? |
Thanks @dontgoto for making the CI much faster. While I'm open to getting this merged, I'm a bit unsure about using this approach. In an ideal world we'd like to call validate_docstrings just once for all errors and for all files. This is clearly not the case today, but hopefully we'll eventually get there, and making things significantly more complex for something hopefully lasting only few months may not be worth. Also, if we want to implement this I think I'd prefer another API where we can simply specify which files and errors to ignore together. Not sure exactly the best way to do this, but something like this should be clearer and simpler IMHO: As I said, I'm not opposed to merge this PR, but I think it's making the validator significantly more complex to understand, which is probably worth for the speedup now, but thinking longer term not so much. What do you think @dontgoto ? |
I think I can change the command line args and their preprocessing quite easily to match your I agree that making the rest of the code match this kind of structure and ditching the repeated sequential calls would be preferable, but maybe a bit much of a time investment. I might look into that in the future though, I have an idea for another PR that might make the docstring checks "commit-hookably" fast, but I still need to test it out. |
I think implementing what I said is not trivial, but I think just one validate call per function would be enough if we do that, so I think it should be as fast as your implementation here, or maybe even a bit faster. |
I pushed a first version of the Let me know what you think about the changes to the If we go ahead with the current version, I'll tidy it up and add tests for the parsing. Merge conflicts for docstring exceptions removed from main in the meantime are to be expected. |
Regarding performance, everything outside the cached validate function is basically free. Calls of |
Thanks for the work on this @dontgoto. What I had in mind is different to what you implemented. And I think it should be simpler, and reasonably fast. Regardless of the command line API we implement, we would end up with a list of function names and which errors we need to ignore for them. We have this stored in a variable, and then we call the validator normally: https://github.com/pandas-dev/pandas/blob/main/scripts/validate_docstrings.py#L317 At this point, we have in the result the errors that have been found in the function. If the error is in the list of errors to ignore, we can just remove that from the result. This way we don't need cache, we don't need to much extra complexity, and we call the parsing and the validation just once. What do you think? |
…nce' into improve_docstring_check_performance
I agree. The version I previously pushed was a half measure, just adding a variant of the CMD parameter required for this solution, but shoehorning it into the old logic. I refactored everything to use the new parameter, simplifying the parsing. I am satisfied with this solution, let me know whether you agree. For the new CMD parameter Open for a different name for the parameter though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the work here @dontgoto, really great job. I think this is way clearer, and also thanks for fixing the typos in the file.
I added a couple of comments that I think should make the code clearer, both in the script and in code_checks.sh
, but in general your changes look great.
…nce' into improve_docstring_check_performance
Looks great, thanks a lot for the work here @dontgoto @jordan-d-murphy @tqa236 do you mind having a look here and sharing any feedback on this change? Thanks! |
This is such a refreshing PR! Love to see this. I am in full support of this new approach. I added one suggestion - but leave it up to you if you think it's valuable to include or not. My main thoughts on this are:
|
Thanks!
It seems that this test is currently failing on this and many other PRs, unit tests are failing on main as well. I'd be happy to see this merged since resolving the conflicts with ongoing doc fix PRs is a hassle. Let me know if there is anything else blocking this. Thanks again for all the great feedback and the welcoming atmosphere :) |
Awesome! Lgtm 🙂 |
Hello, this is a great improvement! LGTM too. |
Amazing job @dontgoto. I see the |
This PR brings the runtime of the docstring check CI down to 2-3 minutes from about 20 minutes.
Currently,
check_code.sh docstring
does multiple calls tovalidate_docstrings.py
due to various error types that have function exceptions. Each run ofvalidate_docstrings
takes about 2-3 minutes, leading to the current 20 minutes runtime just for the docstring checks.The runtime for consecutive calls is brought down to that of a single call by changing the argument parsing of
validate_docstrings
, adding--
to separate parameters that previously were separate function calls. Additionally, a cache is added to reuse the parsing results between runs. The cache size should pose no problem, the Python instance ran by `check_code.sh docstring' only reserves about 95MB of memory on my machine.