Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memoize text width #6552

Merged
merged 2 commits into from
Sep 6, 2023
Merged

Memoize text width #6552

merged 2 commits into from
Sep 6, 2023

Conversation

MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented Aug 14, 2023

Summary

The formatter processes strings a couple of times:

  • propagate_expands: Test if the string contains a newline and, if so, set mode: Propagated on all enclosing groups
  • print_text: Measures the width of the text to compute the line_width
  • fits_text (only if the text is inside of a group): Measures the width of the text to determine if the content fits. Runs n times where n is the number of parent groups.

This PR adds a TextWidth field to all text FormatElements that either stores the computed width or that it is a multiline string.
This reduces the traversal for non-multiline strings to exactly once. Multiline strings still get traversed multiple times and you could argue this PR makes it worse because the implementation now computes the text width only to throw it away when realizing it is a multiline string. However, multiline strings are rare.

Memoizing the text width improves performance by 2-12%.

The downsides of this change are:

  • It will make it harder to add more fields to Text FormatElements because we are now very close to the 24 bytes
  • The same logic is now spread in multiple places, increasing the risk of one-off bugs

Test Plan

The tests are currently failing because I need to update them to pass TextWidth

@MichaReiser
Copy link
Member Author

MichaReiser commented Aug 14, 2023

Current dependencies on/for this PR:

This comment was auto-generated by Graphite.

@MichaReiser MichaReiser force-pushed the printer-reserve-buffer-upfront branch from a97a59a to 3076d97 Compare August 14, 2023 08:26
@github-actions

This comment was marked as outdated.

@MichaReiser MichaReiser force-pushed the printer-reserve-buffer-upfront branch 2 times, most recently from 2db43a5 to 493a3b6 Compare August 14, 2023 12:03
Base automatically changed from printer-reserve-buffer-upfront to main August 14, 2023 12:15
@konstin konstin added the formatter Related to the formatter label Aug 14, 2023
@MichaReiser MichaReiser changed the base branch from main to token-element September 1, 2023 19:16
@MichaReiser MichaReiser force-pushed the memoize-text-width branch 5 times, most recently from eb8d51f to fb43ed9 Compare September 1, 2023 22:08
@MichaReiser MichaReiser marked this pull request as ready for review September 2, 2023 07:49
Base automatically changed from token-element to main September 2, 2023 08:05
@MichaReiser MichaReiser added the performance Potential performance improvement label Sep 2, 2023
@MichaReiser MichaReiser requested a review from konstin September 5, 2023 08:21
crates/ruff_formatter/src/format_element.rs Show resolved Hide resolved
crates/ruff_formatter/src/format_element.rs Outdated Show resolved Hide resolved
'\t' => tab_width.value(),
'\n' => return TextWidth::Multiline,
#[allow(clippy::cast_possible_truncation)]
c => c.width().unwrap_or(0) as u32,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does c not have a width and why would we go with 0 instead, is this about control characters?

Copy link
Member Author

@MichaReiser MichaReiser Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the unicode width documentation

Returns the character's displayed width in columns, or None if the character is a control character other than '\x00'.

So yes, this is about control characters and using 0 seems reasonable to me (this is the same logic as applied by the printer today)

crates/ruff_formatter/src/printer/mod.rs Outdated Show resolved Hide resolved
self.state.line_width = 0;
continue;
Text::Text { text, width } => {
if let Some(width) = width.width() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i think i'd use a match here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me too but our pedantic clippy rule doesn't allow me to

crates/ruff_python_formatter/src/comments/format.rs Outdated Show resolved Hide resolved
@konstin
Copy link
Member

konstin commented Sep 5, 2023

Sorry i had missed this PR.

I'll try to trigger codspeed for perf numbers.

@konstin konstin closed this Sep 5, 2023
@konstin konstin reopened this Sep 5, 2023
@MichaReiser
Copy link
Member Author

Sorry i had missed this PR.

I'll try to trigger codspeed for perf numbers.

Code speed only comments on regressions but you can see the results in the run summary https://github.com/astral-sh/ruff/pull/6552/checks?check_run_id=16438240046

@konstin
Copy link
Member

konstin commented Sep 5, 2023

image
3-12%, that's huge!

/// This imprecision shouldn't matter in practice because either text are longer than any configured line width
/// and thus, the text should break.
#[derive(Copy, Clone, Debug, Eq, PartialEq)]
pub struct Width(NonZeroU32);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this newtype wrapper to lock down the access to the inner NonZeroU32. Not that someone uses it and then gets values that are off by one.

@MichaReiser MichaReiser enabled auto-merge (squash) September 6, 2023 07:03
@MichaReiser
Copy link
Member Author

image 3-12%, that's huge!

The win is bigger for formats that

a) Use more groups or best fitting elements because the printer has to measure fits for every group (there's some optimisation but the worst case scenario is that it computes it for every group)
b) Code that has many lines exceeding the line width (with nested groups), because the printer than has to measure fits for every inner group

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
formatter Related to the formatter performance Potential performance improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants