Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Provide _is_utf8_charset() in compat.php for early use. #7052

Closed

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Jul 17, 2024

Trac ticket: Core-61680

When is_utf8_charset() was introduced, the mb_strlen() and mb_substr() compat functions were modified to call it, but they are defined in compat.php before is_utf8_charset() is defined in functions.php.

Certain code calling these compat functions early in the boot process before functions.php is included and on hosts without the multi-byte extension would thus crash.

In this patch the is_utf8_charset() function is split into pure and stateful components. The pure version is recreated as _is_utf8_charset() and defined in compat.php while the existing function (which defaults to calling get_option( 'blog_charset' )) is left in place in functions.php. This ensures that code calling it will be able to call a form of the function even in early sequences.

Follow-up to [58169].

Props dmsnell, donncha, hellofromTonya, jeherve, slyall, spacedmonkey.
Fixes #61680.

When `is_utf8_charset()` was introduced, the `mb_strlen()` and
`mb_substr()` compat functions were modified to call it, but they
are defined in `compat.php` before `is_utf8_charset()` is defined in
`functions.php`.

Certain code calling these compat functions early in the boot process
before `functions.php` is included and on hosts without the multi-byte
extension would thus crash.

In this patch the `is_utf8_charset()` function is split into pure and
stateful components. The pure version is recreated as
`_is_utf8_charset()` and defined in `compat.php` while the existing
function (which defaults to calling `get_option( 'blog_charset' )`) is
left in place in `functions.php`. This ensures that code calling it will
be able to call a form of the function even in early sequences.

Follow-up to [58169].

Props dmsnell, donncha, hellofromTonya, jeherve, slyall, spacedmonkey.
Fixes #61680.
Copy link

github-actions bot commented Jul 17, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, jorbin, jonsurrell, joemcgill.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@WordPress WordPress deleted a comment from github-actions bot Jul 17, 2024
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks like it will fix the issue. I've been unsuccessful reproducing the original issue so far, so it's difficult for me to confirm the fix.

@@ -40,6 +40,40 @@ function _wp_can_use_pcre_u( $set = null ) {
return $utf8_pcre;
}

/**
* Indicates if a given slug for a character set represents the UTF-8 text encoding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other functions in this file mention that they're for internal use, like this:

Internal compat function to mimic mb_substr().

Should we mention that this is an internal function only intended to prevent load-order issues and that is_utf8_charset should be used in most cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this, but also I don't see a reason why this shouldn't be made available 🤷‍♂️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_utf8_charset is the API we expect folks to use, not this new function. To me, that's a good reason this should include @ignore and be noted as an internal compact function

I also think we need a note for why this is here since it's not a compat function like everything else

Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested this and done code review. I'm approving on those grounds.

I haven't been able to reproduce, so I haven't confirmed the fix.

Copy link
Member

@joemcgill joemcgill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it should fix the issue as reported in Automattic/jetpack#38358. However, a similar problem can happen any time a compat function in this file is called by code prior to when functions.php is loaded. For example, if _mb_substr() is called with $encoding set to null then a similar undefined function error would be triggered when get_option() is called, since the option.php file is loaded at the top of functions.php.

Maybe a better strategy for these polyfills is to always do a function_exists() check before calling any WordPress functions that might be loaded after these polyfills are available and providing a graceful fallback when the function is not yet loaded.

@dmsnell
Copy link
Member Author

dmsnell commented Jul 17, 2024

Thanks @joemcgill - I also noted the apparent call to get_option() but I didn't want to address it in this bug-fix. I could see value in calling function_exists() first, but also I don't see the need to perform this check at runtime when mostly they will always exist.

My initial reaction was that perhaps we could create a new CI task to check if anything in compat() is calling code that hasn't yet been defined. This, of course, would flag the mb_ functions 🤷‍♂️

In any case the situations where these things run is rare enough. I would imagine that we could get away with the runtime checks and have it not matter much. IMO, good work to discuss in an enhancement.

@joemcgill
Copy link
Member

I also noted the apparent call to get_option() but I didn't want to address it in this bug-fix.

Totally agree that handling that separately makes sense.

I could see value in calling function_exists() first, but also I don't see the need to perform this check at runtime when mostly they will always exist.

This is the tricky part. While I appreciate not calling function_exists() when it's not needed in a majority of cases, it seems like the alternative is to either accept the potential fatals (expecting third party code that trigger these fatals to fix the issue) or do what you have done here, which is to create additional compat shims for those functions that splits out parts of the implementation here. For maintainability, I have some reservations to creating private stateless versions of any WP functions that are called from this file.

@dmsnell
Copy link
Member Author

dmsnell commented Jul 17, 2024

Thanks @joemcgill

For maintainability, I have some reservations to creating private stateless versions of any WP functions that are called from this file.

Would you care to share more about this? From my perspective, fixing this makes the code better by introducing this split, particularly since is_utf8_charset() is calling the more testable unit.

The only complication I see is that it's in compat.php, but then again, it's addressing some very low-level stuff required by other compatability code.

I considered duplicating the logic of this function here directly, but given that it was necessary in both of the _mb_substr() and _mb_strlen() functions, I thought it was more susceptible to problems than creating a new function would leave them.


Do you see this as a blocker to the patch as proposed?


Without trying to derail this conversation, I'm only more convinced each week that a good future direction for Core is UTF-8 everywhere, and to that end, some UTF-8 functionality is missing, and there have been many problems lately related to whether the multibyte extension is available.

I've been exploring a native UTF-8 handling system in #6883 and think it has merit. Similar to how the HTML API has replaced in user-space some PHP-provided functionality, incorporating our own UTF-8 encoding and decoding would let us finally get rid of a lot of needless and inconsistent checks and conversions.

I've been exploring some of this code inside Automattic too. When I feel it's ready for actual discussion I'll start posting on Make about it.

@joemcgill
Copy link
Member

Would you care to share more about this?

Sure! Besides the use of an option value, is_utf8_charset() is a pretty simple function currently. Splitting the implementation into two different functions which exist in two separate files could be a bit confusing for future maintainers.

I don't see that maintainability concern as a blocker.

However, the other problem that I just noticed when looking at this again, is that now any time the _mb_substr() and _mb_strlen() polyfills are used, they will not be able to make use of the site's option setting, even though in a majority of cases, the need to avoid the get_option() call is unnecessary and unexpected. That seems like a bigger problem. That does seem like something we want to avoid.

@aaronjorbin
Copy link
Member

I think there are two points (in addition to the UTF-8 everwhere one):

  1. We need to figure out a way to prevent issues like this in the future.

  2. The fatal at hand.

I think we should focus on 2 right now and create a separate core ticket and one where we can at a minimum include some documentation at the top of the file and ideally include some automated tests as well.

@@ -40,6 +40,40 @@ function _wp_can_use_pcre_u( $set = null ) {
return $utf8_pcre;
}

/**
* Indicates if a given slug for a character set represents the UTF-8 text encoding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_utf8_charset is the API we expect folks to use, not this new function. To me, that's a good reason this should include @ignore and be noted as an internal compact function

I also think we need a note for why this is here since it's not a compat function like everything else

@@ -7503,20 +7503,7 @@ function get_tag_regex( $tag ) {
* @return bool Whether the slug represents the UTF-8 encoding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should include an @see _is_utf8_charset since this is now essentially a wrapper function

@aaronjorbin
Copy link
Member

Follow up ticket: https://core.trac.wordpress.org/ticket/61694

@dmsnell
Copy link
Member Author

dmsnell commented Jul 18, 2024

Merged in [58764]
5a30482

@dmsnell dmsnell closed this Jul 18, 2024
@dmsnell dmsnell deleted the fix/move-is-utf8-charset-earlier branch July 18, 2024 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants