Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConPTY mangles U+1F600 to U+FFFD #2770

Closed
chris-morgan opened this issue Sep 16, 2019 · 11 comments
Closed

ConPTY mangles U+1F600 to U+FFFD #2770

chris-morgan opened this issue Sep 16, 2019 · 11 comments
Assignees
Labels
Area-Input Related to input processing (key presses, mouse, etc.) Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Tag-Fix Doesn't match tag requirements Product-Conpty For console issues specifically related to conpty Product-WSL Issue that should probably go to Microsoft/WSL
Milestone

Comments

@chris-morgan
Copy link

Environment

Windows build number: 10.0.18362.356
Windows Terminal version: 0.4.2382.0

Also reproducible with Alacritty on alacritty/alacritty#2438 with its ConPTY backend (enable_experimental_conpty_backend: true), and not with its WinPTY backend.

Steps to reproduce

Type U+1F600. (Ways of achieving this are discussed below.)

Expected behavior

U+1F600, 😀, should be the input.

Actual behavior

U+FFFD, REPLACEMENT CHARACTER, gets sent through instead.

This doesn’t affect most characters. This is the only such character that I regularly use that is affected. U+1F600 and U+1F700 are two examples of affected characters, while U+1F5FF and U+1F601 are examples of characters that are not affected.


How to reproduce if you’re not sure

If you don’t have an IME that lets you insert such characters, you can install WinCompose and type Compose1f600Enter, or put a line like this into %USERPROFIEL%\.XCompose:

<Multi_key> <colon> <D> : "😀" U1F600

… which will allow you to use Compose:D.

To inspect what’s been sent, I like to use Vim with its ga command.

@ghost ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Sep 16, 2019
@DHowett-MSFT
Copy link
Contributor

Sorry, does this fail in Windows Terminal? If so: we're seeing it working here. What shell are you using?

Windows PowerShell and CMD have very poor support for high-unicode input, whereas PowerShell Core supports it handily.

We made some fixes to ConPTY that aren't in 19H1, but are in Terminal, which could explain why Alacritty is having trouble.

@DHowett-MSFT DHowett-MSFT added the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label Sep 16, 2019
@chris-morgan
Copy link
Author

This is failing for me in both Windows Terminal and Alacritty-with-ConPTY, with both cmd and WSL bash as the shell, running both Python (e.g. hex(ord('…'))) and Vim (e.g. i…<Escape>ga) for inspection of input codes.

Command Prompt and Alacritty-with-WinPTY are not experiencing this problem.

@ghost ghost added Needs-Attention The core contributors need to come back around and look at this ASAP. and removed Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something labels Sep 17, 2019
@DHowett-MSFT DHowett-MSFT added Area-Input Related to input processing (key presses, mouse, etc.) Help Wanted We encourage anyone to jump in on these. Issue-Bug It either shouldn't be doing this or needs an investigation. Product-Conpty For console issues specifically related to conpty and removed Needs-Attention The core contributors need to come back around and look at this ASAP. Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting labels Sep 19, 2019
@ghost ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Sep 19, 2019
@DHowett-MSFT DHowett-MSFT added this to the 20H1 milestone Sep 19, 2019
@DHowett-MSFT DHowett-MSFT removed the Help Wanted We encourage anyone to jump in on these. label Sep 19, 2019
@miniksa
Copy link
Member

miniksa commented Sep 23, 2019

I repro'd this against master today with an Ubuntu tab inside Windows Terminal, the WinCompose utility emitting the U+1F600 on Right Alt+Left Shift+;+D

I've traced through the Windows Terminal project and the U+1F600 character is landing as 0xF0 0x9F 0x98 0x80 (this is the valid UTF-8 per http://www.fileformat.info/info/unicode/char/1f600/index.htm) in the call to send it into PTY's input channel.

Next step is to attach to the underlying PTY and see how it might be losing it.

@miniksa
Copy link
Member

miniksa commented Sep 24, 2019

OK. This goes off the rails here:

bool InteractDispatch::WriteString(_In_reads_(cch) const wchar_t* const pws,
const size_t cch)
{
if (cch == 0)
{
return true;
}
unsigned int codepage = 0;
bool fSuccess = !!_pConApi->GetConsoleOutputCP(&codepage);
if (fSuccess)
{
std::deque<std::unique_ptr<IInputEvent>> keyEvents;
for (size_t i = 0; i < cch; ++i)
{
std::deque<std::unique_ptr<KeyEvent>> convertedEvents = CharToKeyEvents(pws[i], codepage);
std::move(convertedEvents.begin(),
convertedEvents.end(),
std::back_inserter(keyEvents));
}
fSuccess = WriteInput(keyEvents);
}
return fSuccess;
}

In InteractDispatch, we appear to be feeding each piece of the surrogate-pair UTF-16 sequence in one wchar_t at a time.

This heads into

std::deque<std::unique_ptr<KeyEvent>> CharToKeyEvents(const wchar_t wch,
const unsigned int codepage)
{
const short invalidKey = -1;
short keyState = VkKeyScanW(wch);
if (keyState == invalidKey)
{
// Determine DBCS character because these character does not know by VkKeyScan.
// GetStringTypeW(CT_CTYPE3) & C3_ALPHA can determine all linguistic characters. However, this is
// not include symbolic character for DBCS.
WORD CharType = 0;
GetStringTypeW(CT_CTYPE3, &wch, 1, &CharType);
if (WI_IsFlagSet(CharType, C3_ALPHA) || GetQuickCharWidth(wch) == CodepointWidth::Wide)
{
keyState = 0;
}
}
std::deque<std::unique_ptr<KeyEvent>> convertedEvents;
if (keyState == invalidKey)
{
// if VkKeyScanW fails (char is not in kbd layout), we must
// emulate the key being input through the numpad
convertedEvents = SynthesizeNumpadEvents(wch, codepage);
}
else
{
convertedEvents = SynthesizeKeyboardEvents(wch, keyState);
}
return convertedEvents;
}
.

At this point, we're identifying these things as invalid keys and synthesizing numpad events instead. But there's not really a valid numpad equivalent to half of a surrogate pair (

convertedEvents = SynthesizeNumpadEvents(wch, codepage);
).

We did identify above on the GetStringTypeW call

GetStringTypeW(CT_CTYPE3, &wch, 1, &CharType);
that this is the high portion of a surrogate pair. CharType has been filled with 0x800 at this point, which does map to C3_HIGHSURROGATE.

The conversion here ends up becoming effectively just the equivalent input of the ALT button going down and up.

It then does the same thing for the low surrogate half.

This then travels for a bit until it hits InputBuffer::_WriteBuffer here:

void InputBuffer::_WriteBuffer(_Inout_ std::deque<std::unique_ptr<IInputEvent>>& inEvents,
_Out_ size_t& eventsWritten,
_Out_ bool& setWaitEvent)
{
eventsWritten = 0;
setWaitEvent = false;
const bool initiallyEmptyQueue = _storage.empty();
const size_t initialInEventsSize = inEvents.size();
const bool vtInputMode = IsInVirtualTerminalInputMode();
while (!inEvents.empty())
{
// Pop the next event.
// If we're in vt mode, try and handle it with the vt input module.
// If it was handled, do nothing else for it.
// If there was one event passed in, try coalescing it with the previous event currently in the buffer.
// If it's not coalesced, append it to the buffer.
std::unique_ptr<IInputEvent> inEvent = std::move(inEvents.front());
inEvents.pop_front();
if (vtInputMode)
{
const bool handled = _termInput.HandleKey(inEvent.get());
if (handled)
{
eventsWritten++;
continue;
}
}
// we only check for possible coalescing when storing one
// record at a time because this is the original behavior of
// the input buffer. Changing this behavior may break stuff
// that was depending on it.
if (initialInEventsSize == 1 && !_storage.empty())
{
// coalescing requires a deque of events, so push it back onto the front.
inEvents.push_front(std::move(inEvent));
bool coalesced = false;
// this looks kinda weird but we don't want to coalesce a
// mouse event and then try to coalesce a key event right after.
//
// we also pass the whole deque to the coalescing methods
// even though they only want one event because it should
// be their responsibility to maintain the correct state
// of the deque if they process any records in it.
if (_CoalesceMouseMovedEvents(inEvents))
{
coalesced = true;
}
else if (_CoalesceRepeatedKeyPressEvents(inEvents))
{
coalesced = true;
}
if (coalesced)
{
eventsWritten = 1;
return;
}
else
{
// We didn't coalesce the event. pull it from the queue again,
// to keep the state consistent with the start of this block.
inEvent = std::move(inEvents.front());
inEvents.pop_front();
}
}
// At this point, the event was neither coalesced, nor processed by VT.
_storage.push_back(std::move(inEvent));
++eventsWritten;
}
if (initiallyEmptyQueue && !_storage.empty())
{
setWaitEvent = true;
}
}

None of the code finds an appropriate mapping for the ALT so it is just stored into the input buffer.

The input buffer alerts waiting readers that data is available.

The WSL driver calls ReadConsoleInputW and attempts to retrieve any input information. It asks for a single event. We give it back. I can't see any further because I'm user mode debugging and I think the PTY drain got moved into kernel mode.

@miniksa
Copy link
Member

miniksa commented Sep 24, 2019

Typing the 😀 character inserts this into the buffer:
↓ wch:0x0000 '\0' mod:LeftAlt (0x00000002) repeat:0x0001 vk:0x0012 vsc:0x0038
↑ wch:0xd83d '☐' mod:None (0x00000000) repeat:0x0001 vk:0x0012 vsc:0x0038
↓ wch:0x0000 '\0' mod:LeftAlt (0x00000002) repeat:0x0001 vk:0x0012 vsc:0x0038
↑ wch:0xde00 '☐' mod:None (0x00000000) repeat:0x0001 vk:0x0012 vsc:0x0038

Pasting the 😀 character inserts this into the buffer:
↓ wch:0x0000 '\0' mod:LeftAlt (0x00000002) repeat:0x0001 vk:0x0012 vsc:0x0038
↑ wch:0xd83d '☐' mod:None (0x00000000) repeat:0x0001 vk:0x0012 vsc:0x0038
↓ wch:0x0000 '\0' mod:LeftAlt (0x00000002) repeat:0x0001 vk:0x0012 vsc:0x0038
↑ wch:0xde00 '☐' mod:None (0x00000000) repeat:0x0001 vk:0x0012 vsc:0x0038

Typing the 😂 character inserts this into the buffer:
↓ wch:0x0000 '\0' mod:LeftAlt (0x00000002) repeat:0x0001 vk:0x0012 vsc:0x0038
↑ wch:0xd83d '☐' mod:None (0x00000000) repeat:0x0001 vk:0x0012 vsc:0x0038
↓ wch:0x0000 '\0' mod:LeftAlt (0x00000002) repeat:0x0001 vk:0x0012 vsc:0x0038
↑ wch:0xde02 '☐' mod:None (0x00000000) repeat:0x0001 vk:0x0012 vsc:0x0038

Pasting the 😂 character inserts this into the buffer:
↓ wch:0x0000 '\0' mod:LeftAlt (0x00000002) repeat:0x0001 vk:0x0012 vsc:0x0038
↑ wch:0xd83d '☐' mod:None (0x00000000) repeat:0x0001 vk:0x0012 vsc:0x0038
↓ wch:0x0000 '\0' mod:LeftAlt (0x00000002) repeat:0x0001 vk:0x0012 vsc:0x0038
↑ wch:0xde02 '☐' mod:None (0x00000000) repeat:0x0001 vk:0x0012 vsc:0x0038

😀 does not work.
😂 does work.

So it appears that the buffer state is consistent between typing and pasting, ruling that out.
It also appears that the format of the input messages is the same for the working and non working case.

Now I suspect the WSL TTY code. I'll look there next.

@chris-morgan
Copy link
Author

Huh, does look like something related to WSL TTY code now. I had incorrectly assumed that running cmd.exe was equivalent to running cmd.exe from inside WSL, but it turns out not to be. Shows how poor a grasp I have on the moving parts!

  • Windows Terminal → cmd: success, 😀 comes through alright.
  • Windows Terminal → ubuntu → cmd: fail, 😀 gets mangled to U+FFFD.
  • Windows Terminal → cmd → ubuntu → cmd: fail, 😀 gets mangled to U+FFFD.

@miniksa
Copy link
Member

miniksa commented Sep 24, 2019

@benhillis, this is _IsActionableKey inside the WSL TTY causing an issue.

For a KEY_EVENT to be aggregated in the WSL TTY to be converted to UTF-8 and sent into the client, each individual event has to pass the _IsActionableKey check inside _GetNextCharacter which is called inside the _ConsoleWorker input loop.

The _IsActionableKey method follows these rules:

😀

↑ wch:0xd83d '☐' mod:None (0x00000000) repeat:0x0001 vk:0x0012 vsc:0x0038

var value evaluation
wRepeatCount 1 false
uChar.UnicodeChar 0xd83d (for reference only)
uChar.AsciiChar 0x3d false
wVirtualScanCode 0x38 true
dwControlKeyState 0x0 false

false || false && true || false --> false --> inverted by function --> return TRUE;

↑ wch:0xde00 '☐' mod:None (0x00000000) repeat:0x0001 vk:0x0012 vsc:0x0038

var value evaluation
wRepeatCount 1 false
uChar.UnicodeChar 0xde00 (for reference only)
uChar.AsciiChar 0x00 true
wVirtualScanCode 0x38 true
dwControlKeyState 0x0 false

false || true && true || false --> true --> inverted by function --> return FALSE;

😂

↑ wch:0xd83d '☐' mod:None (0x00000000) repeat:0x0001 vk:0x0012 vsc:0x0038

var value evaluation
wRepeatCount 1 false
uChar.UnicodeChar 0xd83d (for reference only)
uChar.AsciiChar 0x3d false
wVirtualScanCode 0x38 true
dwControlKeyState 0x0 false

false || false && true || false --> false --> inverted by function --> return TRUE;

↑ wch:0xde02 '☐' mod:None (0x00000000) repeat:0x0001 vk:0x0012 vsc:0x0038

var value evaluation
wRepeatCount 1 false
uChar.UnicodeChar 0xde02 (for reference only)
uChar.AsciiChar 0x02 false
wVirtualScanCode 0x38 true
dwControlKeyState 0x0 false

false || false && true || false --> false --> inverted by function --> return TRUE;


For the 😀 case, the high surrogate passes the check and is appended to the TTY input. The low surrogate fails and is not appended. Then RtlUnicodeToUTF8N is called and it converts the broken surrogate pair unto U+FFFD, the unicode replacement character, which is forwarded through to the WSL instance.

For the 😂 case, the high and low surrogate pass the check, are appended tot he TTY input, and then converted successfully to UTF8 and forwarded into the WSL instance.

So overall, it looks like the _IsActionableKey method needs to be updated to check the whole uChar.UnicodeChar to see if the high word is also null before ruling something out as a null character or it will drop "numpad sequence" type characters that have an 0x00 in the low word.

@miniksa
Copy link
Member

miniksa commented Sep 24, 2019

Now tracking internally as MSFT:23541483

@miniksa miniksa added the Product-WSL Issue that should probably go to Microsoft/WSL label Sep 24, 2019
@DHowett-MSFT
Copy link
Contributor

The fix for this just went out with WSL in insiders’ build 19002!

@ghost ghost added the Needs-Tag-Fix Doesn't match tag requirements label Oct 17, 2019
@benhillis
Copy link
Member

@DHowett - Wow you beat me to it, was planning on looping back here to close this :)

@DHowett-MSFT
Copy link
Contributor

@benhillis y'can't beat a guy on the train to the airport with nothing to do except check twitter! 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Input Related to input processing (key presses, mouse, etc.) Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Tag-Fix Doesn't match tag requirements Product-Conpty For console issues specifically related to conpty Product-WSL Issue that should probably go to Microsoft/WSL
Projects
None yet
Development

No branches or pull requests

4 participants