g_force_unicode is not set when reading utf8 files #13

mingodad · 2023-08-27T12:41:46Z

Looking through the code, the show_help() says:

-utf8          In the absence of a BOM assume UTF-8

If we have a grammar that is/use utf8 it's not detected so we need to use -utf8 to really get a utf8 parser/lexer from our grammar.

Am I missing something here ?

The text was updated successfully, but these errors were encountered:

mingodad · 2023-08-27T12:59:18Z

Also it's always opening the grammar with lexertl::citerator here

gram_grep/main.cpp

Line 290 in e90d24e

iter = lexertl::citerator(_mf.data(), _mf.data() + _mf.size(),

instead of detecting if the input grammar is/use utf8.

mingodad · 2023-08-27T16:48:11Z

Here is an example grammar:

%token au_accented
%%
start: au_accented ;
%%
%%
[ \t\n\r]	skip()
[áéíóú]	au_accented
%%

e_utf8.txt:

é

E_utf8.txt:

É

Outputs:

./gram_grep -f au_utf8.g e_utf8.txt 
./e_utf8.txt(1):é
./e_utf8.txt(1):é
Matches: 2    Matching files: 1    Total files searched: 1
...
./gram_grep -f au_utf8.g E_utf8.txt 
./E_utf8.txt(1):É
Matches: 1    Matching files: 1    Total files searched: 1
...
./gram_grep -utf8 -f au_utf8.g e_utf8.txt 
Matches: 0    Matching files: 0    Total files searched: 1
...
./gram_grep -utf8 -f au_utf8.g E_utf8.txt 
Matches: 0    Matching files: 0    Total files searched: 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

g_force_unicode is not set when reading utf8 files #13

g_force_unicode is not set when reading utf8 files #13

mingodad commented Aug 27, 2023

mingodad commented Aug 27, 2023

mingodad commented Aug 27, 2023

g_force_unicode is not set when reading utf8 files #13

g_force_unicode is not set when reading utf8 files #13

Comments

mingodad commented Aug 27, 2023

mingodad commented Aug 27, 2023

mingodad commented Aug 27, 2023