Fix Utf8Decoder operator== handling of the last code point in the input #1316

BillyONeal · 2024-01-03T21:48:43Z

While working on diagnostics for #1210 I observed that we were printing the caret ^ in the wrong place when it goes after the input.

The way this works is we form the line of text to print, then decode the unicode encoding units, and when we hit the target, we stop and print ^:

vcpkg-tool/src/vcpkg/base/parse.cpp

Lines 51 to 68 in 5b8f9c4

    
           res.append_indent().append(msgFormattedParseMessageExpression, msg::value = line); 
        
           res.append_raw('\n'); 
        
           auto caret_point = StringView{location.start_of_line.pointer_to_current(), location.it.pointer_to_current()}; 
        
           auto formatted_caret_point = msg::format(msgFormattedParseMessageExpression, msg::value = caret_point); 
        
           std::string caret_string; 
        
           caret_string.reserve(formatted_caret_point.data().size()); 
        
           for (char32_t ch : Unicode::Utf8Decoder(formatted_caret_point)) 
        
           { 
        
               if (ch == '\t') 
        
                   caret_string.push_back('\t'); 
        
               else if (Unicode::is_double_width_code_point(ch)) 
        
                   caret_string.append("  "); 
        
               else 
        
                   caret_string.push_back(' '); 
        
           } 
        
           caret_string.push_back('^');

however, if the intended location for the ^ is the "end" of the line, we hit this bug:

vcpkg-tool/src/vcpkg/base/unicode.cpp

Line 273 in 5b8f9c4

return lhs.next_ == rhs.next_;

The iterator only compares the last_ pointers, but both the "points at the last code point in the input" and "points to the end of the input" state set next_ == last_. See:

vcpkg-tool/src/vcpkg/base/unicode.cpp

Lines 222 to 226 in 5b8f9c4

    
           if (next_ == last_) 
        
           { 
        
               current_ = end_of_file; 
        
               return utf8_errc::NoError; 
        
           }

This means that the points to the end and points one past the end iterator compare equal, so the loop in parse.cpp stops one position too early.

Also adds a bunch of testing for this specific case, for other parts of Utf8Decoder, adds a way to parse the first code point without failing, makes all the operators 'hidden friends', and removes localized strings for bugs-in-vcpkg-itself.

While working on diagnostics for microsoft#1210 I observed that we were printing the caret ^ in the wrong place when it goes after the input. The way this works is we form the line of text to print, then decode the unicode encoding units, and when we hit the target, we stop and print ^: https://github.com/microsoft/vcpkg-tool/blob/5b8f9c40dd9d6a07ec636590e78c2f49d28624b9/src/vcpkg/base/parse.cpp#L51-L68 however, if the intended location for the ^ is the "end" of the line, we hit this bug: https://github.com/microsoft/vcpkg-tool/blob/5b8f9c40dd9d6a07ec636590e78c2f49d28624b9/src/vcpkg/base/unicode.cpp#L273 The iterator only compares the last_ pointers, but both the "points at the last code point in the input" and "points to the end of the input" state set `next_ == last_`. See: https://github.com/microsoft/vcpkg-tool/blob/5b8f9c40dd9d6a07ec636590e78c2f49d28624b9/src/vcpkg/base/unicode.cpp#L222-L226 This means that the points to the end and points one past the end iterator compare equal, so the loop in parse.cpp stops one position too early. Also adds a bunch of testing for this specific case, for other parts of Utf8Decoder, adds a way to parse the first code point without failing, makes all the operators 'hidden friends', and removes localized strings for bugs-in-vcpkg-itself.

Thomas1664

noexcept could be added at more places.

include/vcpkg/base/unicode.h

Thomas1664 reviewed Jan 4, 2024

View reviewed changes

include/vcpkg/base/unicode.h Outdated Show resolved Hide resolved

include/vcpkg/base/unicode.h Outdated Show resolved Hide resolved

ras0219-msft approved these changes Jan 4, 2024

View reviewed changes

Add noexcepts as requested by @Thomas1664

5660ebd

BillyONeal merged commit 4e75e1c into microsoft:main Jan 5, 2024
5 checks passed

BillyONeal mentioned this pull request Jan 5, 2024

Improve Unicode perf. #1318

Merged

BillyONeal deleted the fix-unicode branch January 9, 2024 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Utf8Decoder operator== handling of the last code point in the input #1316

Fix Utf8Decoder operator== handling of the last code point in the input #1316

BillyONeal commented Jan 3, 2024

Thomas1664 left a comment

	res.append_indent().append(msgFormattedParseMessageExpression, msg::value = line);
	res.append_raw('\n');

	auto caret_point = StringView{location.start_of_line.pointer_to_current(), location.it.pointer_to_current()};
	auto formatted_caret_point = msg::format(msgFormattedParseMessageExpression, msg::value = caret_point);

	std::string caret_string;
	caret_string.reserve(formatted_caret_point.data().size());
	for (char32_t ch : Unicode::Utf8Decoder(formatted_caret_point))
	{
	if (ch == '\t')
	caret_string.push_back('\t');
	else if (Unicode::is_double_width_code_point(ch))
	caret_string.append(" ");
	else
	caret_string.push_back(' ');
	}
	caret_string.push_back('^');

	if (next_ == last_)
	{
	current_ = end_of_file;
	return utf8_errc::NoError;
	}

Fix Utf8Decoder operator== handling of the last code point in the input #1316

Fix Utf8Decoder operator== handling of the last code point in the input #1316

Conversation

BillyONeal commented Jan 3, 2024

Thomas1664 left a comment

Choose a reason for hiding this comment