Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Utf8Decoder operator== handling of the last code point in the input #1316

Merged
merged 2 commits into from
Jan 5, 2024

Conversation

BillyONeal
Copy link
Member

While working on diagnostics for #1210 I observed that we were printing the caret ^ in the wrong place when it goes after the input.

The way this works is we form the line of text to print, then decode the unicode encoding units, and when we hit the target, we stop and print ^:

res.append_indent().append(msgFormattedParseMessageExpression, msg::value = line);
res.append_raw('\n');
auto caret_point = StringView{location.start_of_line.pointer_to_current(), location.it.pointer_to_current()};
auto formatted_caret_point = msg::format(msgFormattedParseMessageExpression, msg::value = caret_point);
std::string caret_string;
caret_string.reserve(formatted_caret_point.data().size());
for (char32_t ch : Unicode::Utf8Decoder(formatted_caret_point))
{
if (ch == '\t')
caret_string.push_back('\t');
else if (Unicode::is_double_width_code_point(ch))
caret_string.append(" ");
else
caret_string.push_back(' ');
}
caret_string.push_back('^');

however, if the intended location for the ^ is the "end" of the line, we hit this bug:

return lhs.next_ == rhs.next_;

The iterator only compares the last_ pointers, but both the "points at the last code point in the input" and "points to the end of the input" state set next_ == last_. See:

if (next_ == last_)
{
current_ = end_of_file;
return utf8_errc::NoError;
}

This means that the points to the end and points one past the end iterator compare equal, so the loop in parse.cpp stops one position too early.

Also adds a bunch of testing for this specific case, for other parts of Utf8Decoder, adds a way to parse the first code point without failing, makes all the operators 'hidden friends', and removes localized strings for bugs-in-vcpkg-itself.

While working on diagnostics for microsoft#1210 I observed that we were printing the caret ^ in the wrong place when it goes after the input.

The way this works is we form the line of text to print, then decode the unicode encoding units, and when we hit the target, we stop and print ^:

https://github.com/microsoft/vcpkg-tool/blob/5b8f9c40dd9d6a07ec636590e78c2f49d28624b9/src/vcpkg/base/parse.cpp#L51-L68

however, if the intended location for the ^ is the "end" of the line, we hit this bug:

https://github.com/microsoft/vcpkg-tool/blob/5b8f9c40dd9d6a07ec636590e78c2f49d28624b9/src/vcpkg/base/unicode.cpp#L273

The iterator only compares the last_ pointers, but both the "points at the last code point in the input" and "points to the end of the input" state set `next_ == last_`. See:

https://github.com/microsoft/vcpkg-tool/blob/5b8f9c40dd9d6a07ec636590e78c2f49d28624b9/src/vcpkg/base/unicode.cpp#L222-L226

This means that the points to the end and points one past the end iterator compare equal, so the loop in parse.cpp stops one position too early.

Also adds a bunch of testing for this specific case, for other parts of Utf8Decoder, adds a way to parse the first code point without failing, makes all the operators 'hidden friends', and removes localized strings for bugs-in-vcpkg-itself.
Copy link
Contributor

@Thomas1664 Thomas1664 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noexcept could be added at more places.

include/vcpkg/base/unicode.h Outdated Show resolved Hide resolved
include/vcpkg/base/unicode.h Outdated Show resolved Hide resolved
@BillyONeal BillyONeal merged commit 4e75e1c into microsoft:main Jan 5, 2024
5 checks passed
@BillyONeal BillyONeal mentioned this pull request Jan 5, 2024
@BillyONeal BillyONeal deleted the fix-unicode branch January 9, 2024 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants