Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix infinite loop when using characters from unicode supplementary pl… #828

Merged
merged 1 commit into from
Sep 30, 2019
Merged

Fix infinite loop when using characters from unicode supplementary pl… #828

merged 1 commit into from
Sep 30, 2019

Conversation

Ratio2
Copy link
Contributor

@Ratio2 Ratio2 commented Sep 22, 2019

…anes ('🔈' for example)

Fixes # .

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the existing documentation
  • My changes generate no new warnings
  • I have updated the change log (Add/Change/Fix)
  • I have cleaned up the commit history (use rebase and squash)

Changes proposed in this pull request:

@Liryna
Copy link
Member

Liryna commented Sep 24, 2019

Hi @Ratio2 ,

Thank you for the PR. There are a couple of changes in your PR probably not all related to the issue.
Can you clean it ?
I imagine you are talking about put_utf8 for the infinite loop ?

@Ratio2
Copy link
Contributor Author

Ratio2 commented Sep 24, 2019

@Liryna Make cleanup as you requested.

@Liryna
Copy link
Member

Liryna commented Sep 29, 2019

Hi @Ratio2 ,

I have tried to reproduce your issue with a small sample outside dokan_fuse and could not reproduce the infinite loop.

 unsigned char src[] = "🔈";
 unsigned char dst[16];
 convert_char(get_utf16, put_utf8, src, 2, dst);

Am I doing something different ?

@Ratio2
Copy link
Contributor Author

Ratio2 commented Sep 29, 2019

@Liryna I reproduce bug in encfs bundle: https://github.com/jetwhiz/encfs4win/releases/download/v1.10.1/encfs-installer.exe, but bug in dokan itself.

@Liryna
Copy link
Member

Liryna commented Sep 29, 2019

Hi @jetwhiz,

Do you have any knowledge about such issue ?

@Ratio2
Copy link
Contributor Author

Ratio2 commented Sep 29, 2019

@Liryna Bug occured when we convert utf-16 to utf-8. We got 4 bytes, but with invalid value. When we decode this utf-8, only 2 bytes consumed instead 4, second 2 bytes return error with negative result and str len increased (negative cast to big number => undefined behavior) instead reducing.

@Liryna
Copy link
Member

Liryna commented Sep 29, 2019

Do you have sample data (src buffer) that I could use to reproduce ?

@Ratio2
Copy link
Contributor Author

Ratio2 commented Sep 29, 2019

@Liryna I don't report this issue to encfs team. Just find problem and fix it in dokan ;)

@Ratio2
Copy link
Contributor Author

Ratio2 commented Sep 29, 2019

@Ratio2
Copy link
Contributor Author

Ratio2 commented Sep 29, 2019

@Liryna Sorry, my test was bad, i fixed it:

int main()
{
	char utf8[5] = "\xf0\x9f\x94\x88";
	wchar_t utf8_to_utf16[3] = {};
	auto res = convert_char(get_utf8, put_utf16, utf8, 4, utf8_to_utf16);
	assert(res == 4);
	char utf16_to_utf8[5] = {};
	res = convert_char(get_utf16, put_utf8, utf8_to_utf16, 4, utf16_to_utf8);
	assert(res == 4);
	assert(!strcmp(utf8, utf16_to_utf8));
}

@Ratio2
Copy link
Contributor Author

Ratio2 commented Sep 29, 2019

@Liryna One more simple test:

int main()
{
	wchar_t utf16[3] = L"\xd83d\xdd08";
	char utf8[5] = {};
	auto res = convert_char(get_utf16, put_utf8, utf16, 4, utf8);
	assert(res == 4);
	assert(utf8[0] != 0); // wtf?
}

@Liryna
Copy link
Member

Liryna commented Sep 29, 2019

Thank you very much @Ratio2 ! 🏆

I am not the author of the dokan_fuse. I was wondering if you understood why the author used the condition if (c < 0x40) ?

@Ratio2
Copy link
Contributor Author

Ratio2 commented Sep 29, 2019

@Liryna
We have unicode character with code 0x1f508: https://www.unicode.org/charts/PDF/U1F300.pdf
In bits ‭11111010100001000‬ - 17 bits.
According to utf-8 we can encode 16 bits to 3 bytes and 21 bits to 4 bytes: https://en.wikipedia.org/wiki/UTF-8#Description
Split to 4 byte utf-8: 000 011111 010100 001000. First byte is zero, but it must be exists, author of original code make mistake assuming that first byte can not be zero and remain garbage in first byte ;)

@Ratio2
Copy link
Contributor Author

Ratio2 commented Sep 29, 2019

@Liryna Condition if (c < 0x40) check if remain 6 bits or less (max bits per utf-8 byte if encoded more then one).

@Liryna
Copy link
Member

Liryna commented Sep 30, 2019

Thank you @Ratio2 this explanation is very helpful 👍
More special thank for your PR, please do not hesitate if you find anything else 🚀

@Liryna Liryna merged commit 08502af into dokan-dev:master Sep 30, 2019
@jetwhiz
Copy link

jetwhiz commented Oct 1, 2019

Thanks @Ratio2 and @Liryna !

I wonder if this is related to the issue regarding Thai characters eating up 100% CPU resources? jetwhiz/encfs4win#94 Do Thai characters fall into this category?

@Liryna
Copy link
Member

Liryna commented Oct 1, 2019

Hi @jetwhiz

http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%E0%B8%8B%E0%B9%88&mode=char
utf8 code also starts with a zero so therefore the same case as @Ratio2 reported.
I just made and test and I can confirm. Previous code corrupted the character but with @Ratio2 fix, we get the correct value 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants