Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json_stringify/json_parse roundtrip not idempotent; json_stringify output not valid ASCII/UTF-8 #1607

Closed
moschroe opened this issue Aug 5, 2024 · 1 comment · Fixed by #1626
Assignees
Labels

Comments

@moschroe
Copy link

moschroe commented Aug 5, 2024

While processing syslog messages in JSONL format, I came across a logged SSID. SSIDs are notorious for breaking naive tools because they can contain up to 32 arbitrary bytes, not necessarily valid ASCII/UTF-8 strings.

In this instance, the log entry, as written by another tool, valid line-separated JSON, looks like this (formatted for convenience):

{
  "_ts_message": "2024-08-05T07:07:23Z",
  "msg_orig": "<4>Aug  5 09:07:23 %REDACT% Ssid=some-SSID\u0001\b����\u0012$Hl\u0003\u0001\u0006\u0005\u0004",
  "timestamp": "2024-08-05T07:07:23.652051298Z"
}

Running mlr --ijson --ojsonl cat sample_orig.json yields output that is no longer valid JSON as it contains unescaped control chars.:

00000000  7b 22 5f 74 73 5f 6d 65  73 73 61 67 65 22 3a 20  |{"_ts_message": |
00000010  22 32 30 32 34 2d 30 38  2d 30 35 54 30 37 3a 30  |"2024-08-05T07:0|
00000020  37 3a 32 33 5a 22 2c 20  22 6d 73 67 5f 6f 72 69  |7:23Z", "msg_ori|
00000030  67 22 3a 20 22 3c 34 3e  41 75 67 20 20 35 20 30  |g": "<4>Aug  5 0|
00000040  39 3a 30 37 3a 32 33 20  25 52 45 44 41 43 54 25  |9:07:23 %REDACT%|
00000050  20 53 73 69 64 3d 73 6f  6d 65 2d 53 53 49 44 01  | Ssid=some-SSID.|
00000060  5c 62 ef bf bd ef bf bd  ef bf bd ef bf bd 12 24  |\b.............$|
00000070  48 6c 03 01 06 05 04 22  2c 20 22 74 69 6d 65 73  |Hl.....", "times|
00000080  74 61 6d 70 22 3a 20 22  32 30 32 34 2d 30 38 2d  |tamp": "2024-08-|
00000090  30 35 54 30 37 3a 30 37  3a 32 33 2e 36 35 32 30  |05T07:07:23.6520|
000000a0  35 31 32 39 38 5a 22 7d  0a                       |51298Z"}.|

Attempting to process this output (I was using mlr to filter logs) fails with a parser error:
mlr: invalid character '\x01' in string literal.

Expected Fix

json_stringify() (which I assume is used for output encoding as well) must properly encode control characters, as specified at json.org (string syntax diagram).

Edit: The bug seems to live here:

func millerJSONEncodeString(input string) string {

There is a check and branch missing for values outside of 0x20..=0x10FFFF.

@johnkerl
Copy link
Owner

Thank you @moschroe !! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants