-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer token offset is incorrect #24
Comments
Hi @tegefaulkes , streamparser-json moves the offset based on the parsed string. Now you are making me wonder if that is the correct way of counting or not.... 🤔 Can you elaborate a bit on the use case that make you realize of this? |
Sure thing. What was attempting to do is take a stream of JSON objects that were stringifyed and concatenated together in to form of After some digging I worked out that I could extract these top level objects with the JSONParser if I provided and empty string '' for the separator option. Is the offset used relative to the parsed JSON? If it's parsed at that stage then the string offset wouldn't be useful right? I figured the offset value should be correct for the raw input. Or at least a offset value relative to the raw input should be available. |
Oh, I see. You are definitely better off trusting the library to do the split for you. I also noticed that you are only interested in the top level object. Also, as someone suggested in your PR you should into using the whatwg wrapper instead of the raw library. Regarding, the offset, I think that you are right about it being wrong and I will look into fixing it. |
Thanks for the suggestion. |
Following up on this. You made me doubt but the current logic is correct. So in your example test('testing string 2', async () => {
const json = JSON.stringify({"ab\t": "abcd"});
console.log('raw string length: ', json.length)
const tokenizer = new streamParser.Tokenizer()
tokenizer.onToken = (token) => console.log(token);
tokenizer.write(json)
console.log(json[6])
})
// raw string length: 15 // Same length as above
// { token: 0, value: '{', offset: 0 }
// { token: 9, value: 'ab\t', offset: 1 }
// { token: 4, value: ':', offset: 6 } // This is correct because \t is a single byte character. (See https://www.compart.com/en/unicode/U+0009)
// { token: 9, value: 'abcd', offset: 7 }
// { token: 1, value: '}', offset: 13 }
// " // This isn't the character we expected For example: const json = JSON.stringify({"vд": "abcd"});
console.log('raw string length: ', json)
const tokenizer = new streamParser.Tokenizer()
tokenizer.onToken = (token) => console.log(token);
tokenizer.write(json)
console.log(json[7])
// raw string length: 13
// {token: 0, value: "{", offset: 0}
// {token: 9, value: "vд", offset: 1}
// {token: 4, value: ":", offset: 6} // this is correct because д is a single 2-byte character (See https://www.compart.com/en/unicode/U+0434)
// {token: 9, value: "abcd", offset: 7}
// {token: 1, value: "}", offset: 13}
// "a" So, offset works just as intended. What is clear is that documentation could be a lot better 😅 |
Closing as this was clarified. |
The Tokenizer outputs the wrong offset for tokens after a string token with special characters. The difference in the expected offset is consistent with the number of certain special characters within the input string.
Some examples
This is the expected behaviour
Using a single
\t
special characterThe difference in expected output is consistent with the number of special characters
My expectation here should be that the offset is relative to the input. I understand that this is a niche use case but is this something you can fix?
The text was updated successfully, but these errors were encountered: