gherkin: (C) Fix Windows related issues #193

brasmusson · 2017-05-01T12:21:07Z

Summary

Fix Windows related issues:

Use UTF-16 internally when the wchar_t is 2 bytes in size.
Always output UTF-8 encoded data (fixes Gherkin C tests fail on Wine #168).
Avoid swprintf

In addition make the code conform to C90.

Details

When looking into #168, a set of problem were found and fixed.

Use UTF-16 internally when the wchar_t is 2 bytes in size.
Most Windows compilers use 2 bytes for wide characters (wchar_t), that mean the not all unicode code points can be represented by a wide character. To encode code points from the Supplementary Planes UTF-16 are used (using two wide characters for each code point). The decoding of UTF-8 (which is done directly in the code) need to use UTF-16 in the internal wide character representation when necessary.
Always output UTF-8 encoded data (fixes Gherkin C tests fail on Wine #168).
Windows generally does not use UTF-8, even when using unicode it is most likely UTF-16 and not UTF-8. Therefore it seems best to always do the UTF-8 encoding directly in the code, instead of relying on print functions form the complier libraries.
Avoid swprintf
Many (older) Windows compilers use a different signature for swprintf than the ISO standard (excluding the count argument in the signature), therefore it is better to avoid using it and instead calculate the string representation of the error location directly (which was the only use of swprintf).
Another reason to avoid swprintf is that it did not exist in the ISO standard until C99.

In addition make the code conform to C90.
After the use of swprintf was removed, only the use of // for some commented out code was only thing that inhibit the compilation of the code with GCC (>v4.6) using -std=c90 (or -ansi), so the use of // was removed.

Interestingly enough using -std=c90` with GCC (>v4.6) did not complain on repeated typedefs even though they are not allowed until C11. GCC (<v4.6) on the other hand does not allow them at all, so they are removed.

Motivation and Context

Fixes #168.

How Has This Been Tested?

On OS name: "linux", version: "4.8.0-36-generic", arch: "amd64", family: "unix" (ubuntu 16.04) using:

gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final)
gcc (Alpine 6.2.1) 6.2.1 20160822 (docker)

On OS name: "linux", version: "4.4.0-72-generic", arch: "i386", family: "unix" (ubuntu 16.04) using:

gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final)

On OS name: "windows 7", version: "6.1", arch: "amd64", family: "windows" using:

x86_64-w64-mingw32-gcc (GCC) 5.4.0
x86_64-pc-cygwin-gcc (GCC) 5.4.0

On OS name: "windows xp", version: "5.1", arch: "x86", family: "windows" using:

i686-pc-mingw32-gcc (GCC) 4.5.2
i686-pc-cygwin-gcc (GCC) 4.7.3

Types of changes

Bug fix (non-breaking change which fixes an issue).
New feature (non-breaking change which adds functionality).
Breaking change (fix or feature that would cause existing functionality to not work as expected).

Checklist:

I've added tests for my code (tested by the acceptance tests).
My change requires a change to the documentation.
I have updated the documentation accordingly.

On Windows wchar_t is 2 bytes large, and use UTF-16. This means that for the case of code points > 0xFFFF (and wchar_t is only 2 bytes large), the code point read from the UTF-8 source need to be converted to two UTF-16 surrogates (wchar_t wide characters).

Windows does not use UTF-8 by default. To make sure that UTF-8 is used in the output, do manual conversion from unicode/UTF-16 to an UTF-8 byte sequence for output. It seems safer that to try to add Windows specific code to set UTF-8 output on that platform.

To support a wider range of compilers, remove the C90 breaches so that the code compiles also with -ansi and/or -std=c90 (using gcc). Another reason to not use swprintf is that on many Windows compilers swprintf has a different signature than the ISO standard specification. When cleaning up commented out code in gherkin_generate_tokens.c (to remove the usage of // for comments), settle to use the FileTokenScanner to get coverage of it in the acceptance tests. The code does not compile with -std=c90 -pedantic (using gcc), the conformance to C90 is not taken that far.

Technically C does not support repeated typedefs until C11, however modern compilers do allow it (gcc also using -std=c90). But since gcc <v4.6 does not support that, remove the usage of them.

lock · 2018-10-24T23:40:27Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

brasmusson added 4 commits April 29, 2017 19:32

gherkin: (C) Avoid using repeated typedefs (support gcc <v4.6)

6cdd13e

Technically C does not support repeated typedefs until C11, however modern compilers do allow it (gcc also using -std=c90). But since gcc <v4.6 does not support that, remove the usage of them.

aslakhellesoy added language: c library: gherkin labels May 1, 2017

aslakhellesoy merged commit 6cdd13e into master May 1, 2017

brasmusson deleted the gherkin-c-2-byte-wchar-t branch May 1, 2017 18:17

aslakhellesoy added a commit that referenced this pull request May 1, 2017

Attribution. Closes #168, closes #193.

d697792

lock bot locked as resolved and limited conversation to collaborators Oct 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gherkin: (C) Fix Windows related issues #193

gherkin: (C) Fix Windows related issues #193

brasmusson commented May 1, 2017 •

edited

Loading

lock bot commented Oct 24, 2018

gherkin: (C) Fix Windows related issues #193

gherkin: (C) Fix Windows related issues #193

Conversation

brasmusson commented May 1, 2017 • edited Loading

Summary

Details

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

lock bot commented Oct 24, 2018

brasmusson commented May 1, 2017 •

edited

Loading