Validate continuation bytes during code point iteration #2855

nvmkuruc · 2023-12-09T23:24:03Z

Description of Change(s)

This change ensures that only continuation bytes (bytes that start with 10) are consumed when incrementing. Consider an algorithm that searches for the ASCII character . in a string. An algorithm that iterates over a UTF-8 string as bytes (using say std::string::find) would check each byte individually and always find any . characters. An algorithm iterating over code points could miss a valid . following an invalid incomplete 2, 3, or 4 character UTF-8 character sequence if continuations are not checked before incrementing.

This change also changes encoding length checks to use ranges instead of shifting. In many compilers, the same or similar assembly is generated, but in gcc 9.3, shifting introduces an additional movzx. As that's the current compiler targeted by OpenUSD's linux build (and #2673 already made a similar switch from shift to range checks), it seems prudent to make a similar switch.

Fixes Issue(s)

I have verified that all unit tests pass with the proposed changes

I have submitted a signed Contributor License Agreement

jesschimein · 2023-12-11T23:52:03Z

Filed as internal issue #USD-9066

nvmkuruc changed the title ~~Validate continuation bytes during code point iteration.~~ Validate continuation bytes during code point iteration Dec 9, 2023

nvmkuruc force-pushed the continuationbytes branch 8 times, most recently from 3c9a6d7 to 466d53e Compare December 11, 2023 05:28

nvmkuruc mentioned this pull request Dec 11, 2023

Add class to facilitate serialization and validation of code points #2858

Merged

2 tasks

Validate continuation bytes during code point iteration.

4a0caef

nvmkuruc force-pushed the continuationbytes branch from 466d53e to 4a0caef Compare December 12, 2023 01:08

pixar-oss merged commit a953c7d into PixarAnimationStudios:dev Jan 6, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate continuation bytes during code point iteration #2855

Validate continuation bytes during code point iteration #2855

nvmkuruc commented Dec 9, 2023 •

edited

Loading

jesschimein commented Dec 11, 2023

Validate continuation bytes during code point iteration #2855

Validate continuation bytes during code point iteration #2855

Conversation

nvmkuruc commented Dec 9, 2023 • edited Loading

Description of Change(s)

Fixes Issue(s)

jesschimein commented Dec 11, 2023

nvmkuruc commented Dec 9, 2023 •

edited

Loading