Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer token offset is incorrect #24

Closed
tegefaulkes opened this issue Jan 11, 2023 · 6 comments
Closed

Tokenizer token offset is incorrect #24

tegefaulkes opened this issue Jan 11, 2023 · 6 comments

Comments

@tegefaulkes
Copy link

The Tokenizer outputs the wrong offset for tokens after a string token with special characters. The difference in the expected offset is consistent with the number of certain special characters within the input string.

Some examples

This is the expected behaviour

test('testing string 1', async () => {  
  const json = JSON.stringify({"abcd": "abcd"});  
  console.log('raw string length: ', json.length)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[7])  
})  
  
// raw string length:  15
// { token: 0, value: '{', offset: 0 }  
// { token: 9, value: 'abcd', offset: 1 }  
// { token: 4, value: ':', offset: 7 }  // Using this token as the reference
// { token: 9, value: 'abcd', offset: 8 }  
// { token: 1, value: '}', offset: 14 }  
// :  // We print the expected character

Using a single \t special character

test('testing string 2', async () => {  
  const json = JSON.stringify({"ab\t": "abcd"});  
  console.log('raw string length: ', json.length)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[6])  
})  
  
// raw string length:  15 // Same length as above
// { token: 0, value: '{', offset: 0 }  
// { token: 9, value: 'ab\t', offset: 1 }  
// { token: 4, value: ':', offset: 6 } // Off by 1 now
// { token: 9, value: 'abcd', offset: 7 }  
// { token: 1, value: '}', offset: 13 }  
// " // This isn't the character we expected

The difference in expected output is consistent with the number of special characters

test('testing string 3', async () => {  
  const json = JSON.stringify({"\t\n": "abcd"});  
  console.log('raw string length: ', json.length)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[5])  
})  
  
// raw string length:  15  // Same length
// { token: 0, value: '{', offset: 0 }  
// { token: 9, value: '\t\n', offset: 1 }  
// { token: 4, value: ':', offset: 5 }  // Off by 2 now
// { token: 9, value: 'abcd', offset: 6 }  
// { token: 1, value: '}', offset: 12 }  
// n

My expectation here should be that the offset is relative to the input. I understand that this is a niche use case but is this something you can fix?

@juanjoDiaz
Copy link
Owner

Hi @tegefaulkes ,

streamparser-json moves the offset based on the parsed string.
This means that \t or \n count as 1 character.

Now you are making me wonder if that is the correct way of counting or not.... 🤔

Can you elaborate a bit on the use case that make you realize of this?
How is this affecting you?

@tegefaulkes
Copy link
Author

Sure thing.

What was attempting to do is take a stream of JSON objects that were stringifyed and concatenated together in to form of {jsonObject}{jsonObject} separate them by taking the sub array of the first object. for that I needed to find the offset of the first closing bracket '}'. But since I need to take the subarray of the raw input string, the offset provided by the tokenizer would be wrong if the object contained any special characters. This means I couldn't take the correct subarray without accounting for the special characters which was very inefficient for my use case.

After some digging I worked out that I could extract these top level objects with the JSONParser if I provided and empty string '' for the separator option.

Is the offset used relative to the parsed JSON? If it's parsed at that stage then the string offset wouldn't be useful right? I figured the offset value should be correct for the raw input. Or at least a offset value relative to the raw input should be available.

@juanjoDiaz
Copy link
Owner

Oh, I see.

You are definitely better off trusting the library to do the split for you.
There can be nested objects with their own closing brackets, closing brackets within text, etc...
The library and the separator option takes care of all that for you.

I also noticed that you are only interested in the top level object.
You should use the paths option set to ['$'] and the parser will only emit the top level object and no sub-objects. So you don't need to check if the parent is null in the stack.

Also, as someone suggested in your PR you should into using the whatwg wrapper instead of the raw library.

Regarding, the offset, I think that you are right about it being wrong and I will look into fixing it.

@tegefaulkes
Copy link
Author

Thanks for the suggestion.

@juanjoDiaz
Copy link
Owner

Following up on this.

You made me doubt but the current logic is correct.
The offset property counts the bytes in the stream. Not the characters in the string that those bytes represent.

So in your example

test('testing string 2', async () => {  
  const json = JSON.stringify({"ab\t": "abcd"});  
  console.log('raw string length: ', json.length)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[6])  
})  
  
// raw string length:  15 // Same length as above
// { token: 0, value: '{', offset: 0 }  
// { token: 9, value: 'ab\t', offset: 1 }  
// { token: 4, value: ':', offset: 6 } // This is correct because \t is a single byte character. (See https://www.compart.com/en/unicode/U+0009) 
// { token: 9, value: 'abcd', offset: 7 }  
// { token: 1, value: '}', offset: 13 }  
// " // This isn't the character we expected

For example:

  const json = JSON.stringify({"vд": "abcd"});  
  console.log('raw string length: ', json)  
  const tokenizer = new streamParser.Tokenizer()  
  tokenizer.onToken = (token) => console.log(token);  
  tokenizer.write(json)  
  console.log(json[7])  

// raw string length: 13
// {token: 0, value: "{", offset: 0}
// {token: 9, value: "vд", offset: 1}
// {token: 4, value: ":", offset: 6} // this is correct because д is a single 2-byte character (See https://www.compart.com/en/unicode/U+0434)
// {token: 9, value: "abcd", offset: 7}
// {token: 1, value: "}", offset: 13}
// "a"

So, offset works just as intended.
Having a char offset instead of a byte offset, is something that could be added to the library if needed.

What is clear is that documentation could be a lot better 😅

@juanjoDiaz
Copy link
Owner

Closing as this was clarified.
Feel free to reopen if you think that there are still open questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants