Tokenizer token offset is incorrect #24

tegefaulkes · 2023-01-11T02:30:22Z

The Tokenizer outputs the wrong offset for tokens after a string token with special characters. The difference in the expected offset is consistent with the number of certain special characters within the input string.

Some examples

This is the expected behaviour

test('testing string 1', async () => {  
  const json = JSON.stringify({"abcd": "abcd"});  
  console.log('raw string length: ', json.length)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[7])  
})  
  
// raw string length:  15
// { token: 0, value: '{', offset: 0 }  
// { token: 9, value: 'abcd', offset: 1 }  
// { token: 4, value: ':', offset: 7 }  // Using this token as the reference
// { token: 9, value: 'abcd', offset: 8 }  
// { token: 1, value: '}', offset: 14 }  
// :  // We print the expected character

Using a single \t special character

test('testing string 2', async () => {  
  const json = JSON.stringify({"ab\t": "abcd"});  
  console.log('raw string length: ', json.length)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[6])  
})  
  
// raw string length:  15 // Same length as above
// { token: 0, value: '{', offset: 0 }  
// { token: 9, value: 'ab\t', offset: 1 }  
// { token: 4, value: ':', offset: 6 } // Off by 1 now
// { token: 9, value: 'abcd', offset: 7 }  
// { token: 1, value: '}', offset: 13 }  
// " // This isn't the character we expected

The difference in expected output is consistent with the number of special characters

test('testing string 3', async () => {  
  const json = JSON.stringify({"\t\n": "abcd"});  
  console.log('raw string length: ', json.length)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[5])  
})  
  
// raw string length:  15  // Same length
// { token: 0, value: '{', offset: 0 }  
// { token: 9, value: '\t\n', offset: 1 }  
// { token: 4, value: ':', offset: 5 }  // Off by 2 now
// { token: 9, value: 'abcd', offset: 6 }  
// { token: 1, value: '}', offset: 12 }  
// n

My expectation here should be that the offset is relative to the input. I understand that this is a niche use case but is this something you can fix?

The text was updated successfully, but these errors were encountered:

juanjoDiaz · 2023-01-11T18:02:21Z

Hi @tegefaulkes ,

streamparser-json moves the offset based on the parsed string.
This means that \t or \n count as 1 character.

Now you are making me wonder if that is the correct way of counting or not.... 🤔

Can you elaborate a bit on the use case that make you realize of this?
How is this affecting you?

tegefaulkes · 2023-01-12T01:23:33Z

Sure thing.

What was attempting to do is take a stream of JSON objects that were stringifyed and concatenated together in to form of {jsonObject}{jsonObject} separate them by taking the sub array of the first object. for that I needed to find the offset of the first closing bracket '}'. But since I need to take the subarray of the raw input string, the offset provided by the tokenizer would be wrong if the object contained any special characters. This means I couldn't take the correct subarray without accounting for the special characters which was very inefficient for my use case.

After some digging I worked out that I could extract these top level objects with the JSONParser if I provided and empty string '' for the separator option.

Is the offset used relative to the parsed JSON? If it's parsed at that stage then the string offset wouldn't be useful right? I figured the offset value should be correct for the raw input. Or at least a offset value relative to the raw input should be available.

juanjoDiaz · 2023-01-12T19:18:08Z

Oh, I see.

You are definitely better off trusting the library to do the split for you.
There can be nested objects with their own closing brackets, closing brackets within text, etc...
The library and the separator option takes care of all that for you.

I also noticed that you are only interested in the top level object.
You should use the paths option set to ['$'] and the parser will only emit the top level object and no sub-objects. So you don't need to check if the parent is null in the stack.

Also, as someone suggested in your PR you should into using the whatwg wrapper instead of the raw library.

Regarding, the offset, I think that you are right about it being wrong and I will look into fixing it.

tegefaulkes · 2023-01-16T01:48:13Z

Thanks for the suggestion.

juanjoDiaz · 2023-01-22T22:17:55Z

Following up on this.

You made me doubt but the current logic is correct.
The offset property counts the bytes in the stream. Not the characters in the string that those bytes represent.

So in your example

test('testing string 2', async () => {  
  const json = JSON.stringify({"ab\t": "abcd"});  
  console.log('raw string length: ', json.length)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[6])  
})  
  
// raw string length:  15 // Same length as above
// { token: 0, value: '{', offset: 0 }  
// { token: 9, value: 'ab\t', offset: 1 }  
// { token: 4, value: ':', offset: 6 } // This is correct because \t is a single byte character. (See https://www.compart.com/en/unicode/U+0009) 
// { token: 9, value: 'abcd', offset: 7 }  
// { token: 1, value: '}', offset: 13 }  
// " // This isn't the character we expected

For example:

  const json = JSON.stringify({"vд": "abcd"});  
  console.log('raw string length: ', json)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[7])  

// raw string length: 13
// {token: 0, value: "{", offset: 0}
// {token: 9, value: "vд", offset: 1}
// {token: 4, value: ":", offset: 6} // this is correct because д is a single 2-byte character (See https://www.compart.com/en/unicode/U+0434)
// {token: 9, value: "abcd", offset: 7}
// {token: 1, value: "}", offset: 13}
// "a"

So, offset works just as intended.
Having a char offset instead of a byte offset, is something that could be added to the library if needed.

What is clear is that documentation could be a lot better 😅

juanjoDiaz · 2023-05-26T09:14:49Z

Closing as this was clarified.
Feel free to reopen if you think that there are still open questions.

tegefaulkes mentioned this issue Jan 11, 2023

Transport Agnostic RPC implementation MatrixAI/Polykey#498

Merged

19 tasks

juanjoDiaz closed this as completed May 26, 2023

juanjoDiaz mentioned this issue Nov 12, 2024

Fix offset value when escape characters exist #48

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer token offset is incorrect #24

Tokenizer token offset is incorrect #24

tegefaulkes commented Jan 11, 2023

juanjoDiaz commented Jan 11, 2023

tegefaulkes commented Jan 12, 2023

juanjoDiaz commented Jan 12, 2023

tegefaulkes commented Jan 16, 2023

juanjoDiaz commented Jan 22, 2023

juanjoDiaz commented May 26, 2023

Tokenizer token offset is incorrect #24

Tokenizer token offset is incorrect #24

Comments

tegefaulkes commented Jan 11, 2023

Some examples

juanjoDiaz commented Jan 11, 2023

tegefaulkes commented Jan 12, 2023

juanjoDiaz commented Jan 12, 2023

tegefaulkes commented Jan 16, 2023

juanjoDiaz commented Jan 22, 2023

juanjoDiaz commented May 26, 2023