indices
reports byte offsets instead of character offsets
#3064
Labels
indices
reports byte offsets instead of character offsets
#3064
Describe the bug
jq uses characters to index strings.
To see that, we can run
"🇬🇧oo" | .[0 : 1,2,3,4]
, which yields "🇬" "🇬🇧" "🇬🇧o" "🇬🇧oo".Note that 🇬🇧 is actually two characters and 8 bytes, as we can see from
"🇬🇧" | length, utf8bytelength
.However, the
indices
filter returns byte offsets to the pattern in the string.The documentation does not specify the behaviour of
indices
for UTF-8 strings, but given thatlength
and.[x:y]
use character counts to index strings, it is likely that this is a bug and not just undocumented behaviour.To Reproduce
$ ./jq-linux-amd64-1.7.1 -nc '"🇬🇧oo" | indices("o")'
[8,9]
$ ./jq-linux-amd64-1.7.1 -nc '"ƒoo" | indices("o")'
[2,3]
Expected behavior
$ ./jq-linux-amd64-1.7.1-fixed -nc '"🇬🇧oo" | indices("o")'
[2,3]
$ ./jq-linux-amd64-1.7.1-fixed -nc '"ƒoo" | indices("o")'
[1,2]
The problem is probably caused in jv_string_indexes.
The text was updated successfully, but these errors were encountered: