Preserving whitespace within select occurrences of braces #4685

just-ero · 2024-08-26T12:41:46Z

just-ero
Aug 26, 2024

Hello, I would like to parse the following:

state("") {
  T0 foo : 0;
  T1 bar : 0x0;
}

state("", "") {
  T2 baz: "", 0, 0;
  T3 qux: 0, 0b0000;
}

startup {
  // Preserve raw contents.
}

init {
  // Preserve raw contents.
}

update {
  // Preserve raw contents.
}

// ...

In short: the contents within the state blocks are custom. I can already parse those and they aren't a problem.

However, the contents in all other blocks is plain C# code that is compiled using Roslyn. I therefore need to both preserve the whitespace, comments, etc. (for accurate line numbers that point to the correct place in the original script) so that the code can be compiled accurately.

How do I do this? I need to be able to access both the block's name and its raw contents from the generated code.

kaby76 · 2024-08-26T15:23:53Z

kaby76
Aug 26, 2024

As I said before (https://stackoverflow.com/questions/78901061/ignoring-whitespace-everywhere-except-for-specific-rule#comment139111216_78901061), you need to use lexer modes for C# code blocks, or you need to define the parser grammar for these C# code blocks. You can then extract the text that includes whitespace from the char stream. For an example of the lexer mode solution, see the antlr4 grammar in grammars-g4. That grammar goes to a lexer mode when a left curly is found. https://github.com/antlr/grammars-v4/blob/e07dbbf3445d31da61af5f54f04df78ea40ab9f8/antlr/antlr4/ANTLRv4Lexer.g4#L114 In TargetLanguageAction, chars are individually tokenized. But you could as well just "more()" the characters onto one large token for the entire C# code block.

5 replies

just-ero Aug 26, 2024
Author

In case it wasn't obvious by the fact that I am asking a second time; I have no idea what you're telling me to do.

kaby76 Aug 26, 2024

This is a split grammar that implements a lexer mode for "C# blocks".

parser grammar MyParser;
options { tokenVocab = MyLexer; }
file_ : decl* EOF;
decl: state_decl | csharp_decl;
state_decl : 'state' OP .*? CP OC .*? CC ;
csharp_decl : Identifier CSharpBlock ;

lexer grammar MyLexer;
tokens { CSharpBlock }
State_: 'state';
Identifier : ('init' | 'update' | 'startup') -> pushMode(csharp_mode);
OC: '{';
CC: '}';
OP: '(';
CP: ')';
WS: [ \t\r\n]+ -> channel(HIDDEN);
Whatever : .;

mode csharp_mode;
OC_cs: '{' -> more, pushMode(csharp_mode);
CC_cs: '}' {
	if (ModeStack.Count > 2)
	{
		this.More();
		this.PopMode();
	}
	else
	{
		this.PopMode();
		this.PopMode();
		this.Type = MyLexer.CSharpBlock;
	}
};
Whatever_cs: . -> more;

Input:

state("") {
  T0 foo : 0;
  T1 bar : 0x0;
}

state("", "") {
  T2 baz: "", 0, 0;
  T3 qux: 0, 0b0000;
}

startup {
  // Preserve raw contents.
}

init {
  // Preserve raw contents.
}

update {
  // Preserve raw contents.
}

Token stream for input:

$ ./bin/Debug/net8.0/Test.exe in.txt -tokens
[@0,0:4='state',<2>,1:0]
[@1,5:5='(',<6>,1:5]
[@2,6:6='"',<9>,1:6]
[@3,7:7='"',<9>,1:7]
[@4,8:8=')',<7>,1:8]
[@5,9:9=' ',<8>,channel=1,1:9]
[@6,10:10='{',<4>,1:10]
[@7,11:14='\r\n  ',<8>,channel=1,1:11]
[@8,15:15='T',<9>,2:2]
[@9,16:16='0',<9>,2:3]
[@10,17:17=' ',<8>,channel=1,2:4]
[@11,18:18='f',<9>,2:5]
[@12,19:19='o',<9>,2:6]
[@13,20:20='o',<9>,2:7]
[@14,21:21=' ',<8>,channel=1,2:8]
[@15,22:22=':',<9>,2:9]
[@16,23:23=' ',<8>,channel=1,2:10]
[@17,24:24='0',<9>,2:11]
[@18,25:25=';',<9>,2:12]
[@19,26:29='\r\n  ',<8>,channel=1,2:13]
[@20,30:30='T',<9>,3:2]
[@21,31:31='1',<9>,3:3]
[@22,32:32=' ',<8>,channel=1,3:4]
[@23,33:33='b',<9>,3:5]
[@24,34:34='a',<9>,3:6]
[@25,35:35='r',<9>,3:7]
[@26,36:36=' ',<8>,channel=1,3:8]
[@27,37:37=':',<9>,3:9]
[@28,38:38=' ',<8>,channel=1,3:10]
[@29,39:39='0',<9>,3:11]
[@30,40:40='x',<9>,3:12]
[@31,41:41='0',<9>,3:13]
[@32,42:42=';',<9>,3:14]
[@33,43:44='\r\n',<8>,channel=1,3:15]
[@34,45:45='}',<5>,4:0]
[@35,46:49='\r\n\r\n',<8>,channel=1,4:1]
[@36,50:54='state',<2>,6:0]
[@37,55:55='(',<6>,6:5]
[@38,56:56='"',<9>,6:6]
[@39,57:57='"',<9>,6:7]
[@40,58:58=',',<9>,6:8]
[@41,59:59=' ',<8>,channel=1,6:9]
[@42,60:60='"',<9>,6:10]
[@43,61:61='"',<9>,6:11]
[@44,62:62=')',<7>,6:12]
[@45,63:63=' ',<8>,channel=1,6:13]
[@46,64:64='{',<4>,6:14]
[@47,65:68='\r\n  ',<8>,channel=1,6:15]
[@48,69:69='T',<9>,7:2]
[@49,70:70='2',<9>,7:3]
[@50,71:71=' ',<8>,channel=1,7:4]
[@51,72:72='b',<9>,7:5]
[@52,73:73='a',<9>,7:6]
[@53,74:74='z',<9>,7:7]
[@54,75:75=':',<9>,7:8]
[@55,76:76=' ',<8>,channel=1,7:9]
[@56,77:77='"',<9>,7:10]
[@57,78:78='"',<9>,7:11]
[@58,79:79=',',<9>,7:12]
[@59,80:80=' ',<8>,channel=1,7:13]
[@60,81:81='0',<9>,7:14]
[@61,82:82=',',<9>,7:15]
[@62,83:83=' ',<8>,channel=1,7:16]
[@63,84:84='0',<9>,7:17]
[@64,85:85=';',<9>,7:18]
[@65,86:89='\r\n  ',<8>,channel=1,7:19]
[@66,90:90='T',<9>,8:2]
[@67,91:91='3',<9>,8:3]
[@68,92:92=' ',<8>,channel=1,8:4]
[@69,93:93='q',<9>,8:5]
[@70,94:94='u',<9>,8:6]
[@71,95:95='x',<9>,8:7]
[@72,96:96=':',<9>,8:8]
[@73,97:97=' ',<8>,channel=1,8:9]
[@74,98:98='0',<9>,8:10]
[@75,99:99=',',<9>,8:11]
[@76,100:100=' ',<8>,channel=1,8:12]
[@77,101:101='0',<9>,8:13]
[@78,102:102='b',<9>,8:14]
[@79,103:103='0',<9>,8:15]
[@80,104:104='0',<9>,8:16]
[@81,105:105='0',<9>,8:17]
[@82,106:106='0',<9>,8:18]
[@83,107:107=';',<9>,8:19]
[@84,108:109='\r\n',<8>,channel=1,8:20]
[@85,110:110='}',<5>,9:0]
[@86,111:114='\r\n\r\n',<8>,channel=1,9:1]
[@87,115:121='startup',<3>,11:0]
[@88,122:155=' {\r\n  // Preserve raw contents.\r\n}',<1>,11:7]
[@89,156:159='\r\n\r\n',<8>,channel=1,13:1]
[@90,160:163='init',<3>,15:0]
[@91,164:197=' {\r\n  // Preserve raw contents.\r\n}',<1>,15:4]
[@92,198:201='\r\n\r\n',<8>,channel=1,17:1]
[@93,202:207='update',<3>,19:0]
[@94,208:241=' {\r\n  // Preserve raw contents.\r\n}',<1>,19:6]
[@95,242:243='\r\n',<8>,channel=1,21:1]
[@96,244:243='<EOF>',<-1>,22:0]

CSharp 0 in.txt success 0.0183169
Total Time: 0.0746913
08/26-12:55:26 ~/issues/a4-4685/Generated-CSharp

This grammar works by differentiating the lexing for "state" blocks versus "C# blocks". For "state" decls, the "default mode" is in effect, which means the lexer rules State_, Identifier, OC, CC, OP, CP, WS, and Whatever are recognized. Note that if the lexer "sees" an "init", "update", or "startup" indicating the start of a "C# block", the mode "csharp_mode" is entered. The caveat here is the assumption that these strings don't occur in "state" blocks. At this point, only the rules OC_cs, CC_cs, and Whatever_cs are recognized. In mode "csharp_mode", we look for open and close curlies, counting and match '{' with a '}'. We have to do this because the C# code may have C# blocks. Note that "C# blocks" are now one token with the entire string containing the C# code block. When we "see" a matching close curly, the mode is exited, and we go back to the default lexer mode.

Alternatively, you could implement this by added in the csharp grammar into your existing grammar. You would then need to avoid name collisions between csharp and your grammar, and you would also have to deal with lexer rule overlaps.

just-ero Aug 28, 2024
Author

I appreciate the writeup, even if I'll have to wrap my head around whatever is happening there, a lot of that is completely new to me.

The caveat here is the assumption that these strings don't occur in "state" blocks.

This sounds like a tremendous downside and is not something I can accept. Is there any way to fix that?

kaby76 Aug 29, 2024

The caveat here is the assumption that these strings don't occur in "state" blocks.

This sounds like a tremendous downside and is not something I can accept. Is there any way to fix that?

I don't know enough about your language. For example, I don't know whether this is possible, or know how you want to tokenize things inside a "state" block.

state("") {
  T0 foo : 0;
  T1 bar : 0x0;
   init { // "init" inside a "state" block.
      // Preserve raw contents.
   }
}
state("") {
   T0 state: 0;  // "state" inside a "state" block.
}
etc....

The point is that a lexer mode is a way of changing the lexing in lieu of "parser influence." You'll need to work out the details.

just-ero Aug 29, 2024
Author

Thanks for your time, but I'm giving up.

Antlr4 simply doesn't appear to be able to do what I require.
I do not want a CSharpBlock to contain the leading and trailing braces.
I need to preserve comments in action blocks, but any braces and mentions of actions/state within them must not be recognized.
State blocks are allowed to contain the names of actions and state itself, but must not feature anything besides the declarations in the format Type Id: [string ,] Number [, Number]*;. I was going to look at how to do this later, but the final declaration does not need a trailing ; (legacy compatibility). State blocks must not contain nested braces.

Besides this, Antlr4's error messages leave a lot to be desired and I'll be forced to transform them to something that actually makes sense to users.

Again, I appreciate you trying to help, but this is a lost endeavor.

jimidle · 2024-08-29T22:53:02Z

jimidle
Aug 29, 2024

Why not just get the start of your ‘{‘ and the start of your ‘}’ token and extract the text from your input stream? You will need real tokens though not literal strings. On Aug 29, 2024, at 04:59, Ero ***@***.***> wrote: Thanks for your time, but I'm giving up. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

1 reply

just-ero Aug 30, 2024
Author

Because the state block also has { and } tokens which must be treated differently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserving whitespace within select occurrences of braces #4685

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Preserving whitespace within select occurrences of braces #4685

just-ero Aug 26, 2024

Replies: 2 comments · 6 replies

kaby76 Aug 26, 2024

just-ero Aug 26, 2024 Author

kaby76 Aug 26, 2024

just-ero Aug 28, 2024 Author

kaby76 Aug 29, 2024

just-ero Aug 29, 2024 Author

jimidle Aug 29, 2024

just-ero Aug 30, 2024 Author

just-ero
Aug 26, 2024

Replies: 2 comments 6 replies

kaby76
Aug 26, 2024

just-ero Aug 26, 2024
Author

just-ero Aug 28, 2024
Author

just-ero Aug 29, 2024
Author

jimidle
Aug 29, 2024

just-ero Aug 30, 2024
Author