-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup line_offset property #1392
Conversation
* Replace dynamic regex with string find operation * Add cache of where each line starts so we don't have quadratic behavior identifying line numbers when importing large chunks of html
Forgot to remove old implementation
Thanks for triggering CI. Based on its feedback, I have
In rewording this comment, I realize that there is a slight change in behavior of this function when |
@waylan I am not as familiar with the HTML parsing that you implemented, can a buffer in this situation change on you invalidating the cached offset?
Generally, I would say that to have a better chance of getting merged quickly, backwards compatible would be better. When something backwards incompatible is introduced, we would need to be even more cautious to ensure this doesn't inadvertently break some other behavior that is going to cause us more time to debug later. While the speedup is certainly nice, I would definitely want to ensure this hasn't introduced some unexpected parsing behavior that is going to break 3rd party plugins that aren't immediately obvious. |
I guess if the change could be somehow identified as inadvertent behavior, then it would make sense to not retain backwards compatibility. |
My memory is very hazy about this. That said, the code that this PR replaces is discussed in #1066 and/or #1068 and was committed in #1069. It appears that we added tests to cover the various edge cases.
Yes, we want the exact same behavior. It is possible that a file may not end with a newline and that would have been taken into consideration with the existing implementation. Therefore, we would need the method to always return the position of a |
In short yes. See this comment (and the few that follow) for an explanation. As a reminder, we are using the standard library HTML parser to extract HTML from non-HTML and need to hack it to make it work with any non-HTML which contains angle brackets.
Actually, what I should have said is that the method always must return the beginning of a line. EOF is only the beginning of a (empty) line if the buffer ends with a newline. If it does not end with a newline, the EOF is not the beginning of a line and it would be an error to return that position. In fact, looking at the example given in the above linked comment, it is possible that returning EOF could fail to address the purpose of this hack altogether. |
This poses a problem then. If the cache can be invalidated without us knowing, then caching is likely a bad idea. |
Sorry, I think I misread your question. No, the buffer doesn't change. However, the parser jumps too far ahead and we need to back it up. What is being cached is the last position. The new position we back it up to should always be after the last cached position. And now I'm wondering if I am wrong about what to do if no more newlines are found. |
The old behavior is actually a bit more complex than that. Here's a sketch of the current (before my patch) implementation:
I suspect that there's a bug in the existing code and that we should return ed: remove caching invalidation discussion |
What is curious is why no tests are failing. Unless the character at the |
I just took another look at this. Specifically, the behavior when no newlines are found. I confirmed that we have no tests for this edge case. In fact, the existing code has a I don't recall if I added the comment because I couldn't come up with a scenario which triggered that code path, or because I was focusing on other things and expected to come back to it later. In any event, we should probably try to come up with a test for this. If we do, then that should dictate how it should behave. If not, then I'm not going to be too concerned about it. |
So, the concern (how the HTML parser could get into the state where the line number is greater than the total number of lines) is described in this comment. However, among various similar tests, the following test exists: markdown/tests/test_syntax/blocks/test_html_blocks.py Lines 1119 to 1139 in 4f0b91a
I just tried many other variations and I cannot get that condition. I expect that it would take a very unusual edge case (with very poorly formed HTML) to ever get that condition. Therefore, I'm not going to sweet it. |
When using
mkdocs
, I noticed huge runtimes and tracked these down toHTMLExtractor.line_offset
taking all the time. Before this patch, a run ofmkdocs
could take 500+ seconds, with this patch, it takes ~10 seconds.Flamegraph before optimization:
Flamegraph after optimization:
Note: It may be the case that the full cache of start-positions of all lines isn't needed, and just a single entry is sufficient; I didn't explore this optimization.