Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON column containing object or array of objects gets quoted as a string - breaks parsing downstream #221

Open
dhenson02 opened this issue Dec 12, 2021 · 7 comments
Labels

Comments

@dhenson02
Copy link

dhenson02 commented Dec 12, 2021

Updating a table with JSON columns will treat the row data as a standard string, adding erroneous double-quotes around the value, instead of returning as an array/object.

Example result (from .columns):

{"name":"account_no_history","value":"[{"acc":"3597","createdby":"Viztek,Pacs","createddt":"2015-09-24T17:14:30.728Z"}]"}

This throws when attempting to parse (tried jq and Node)

@dhenson02 dhenson02 changed the title JSON column containing array of objects gets quoted as a string - breaks parsing downstream JSON column containing object or array of objects gets quoted as a string - breaks parsing downstream Dec 12, 2021
@eulerto eulerto added the bug label Jan 26, 2022
@Venryx
Copy link

Venryx commented Jan 30, 2022

I have the same issue from Rust, using the tokio-postgres crate.

For jsonb maps/objects, it's not so big a deal, because I can just reparse the string using standard JSON parsers.

However, that is not the case for arrays! Arrays are stored in some other format, where a single value (without quotes in it) looks like this:

{example}

whereas others look like this:

{"example \"text\" with quotes in it",123123}

This inconsistency is annoying, and makes custom parsing harder.

Prior to an official resolution of this issue, does anyone know the name of the format used above, so I can find a parser for it? (I'd rather not rely on my own attempt to escape quotes and such properly)

@Venryx
Copy link

Venryx commented Jan 30, 2022

As I don't know the name of the format, I made a quick attempt at writing a parser for it in Rust.

I am very new to Rust, so the only way I was able to get it to compile was by unrolling the end_current_entry function, and changing the current_entry_str variable from a nice Option<String> to the inferior plain String (inferior because the meaning of an empty string is less clear than the None enum).

Anyway, here it is:

use super::type_aliases::JSONValue;

#[cfg(test)]
mod tests {
    use crate::utils::postgres_parsing::parse_postgres_array_as_strings;

    #[test]
    fn simple() {
        let simple_source = "{example}";
        let simple_result = parse_postgres_array_as_strings(simple_source);
        assert_eq!(simple_result, vec!["example"]);
    }
    #[test]
    fn escaped() {
        let escaped_source = r#"{"example \"text\" with quotes in it",123123}"#;
        let escaped_result = parse_postgres_array_as_strings(escaped_source);
        assert_eq!(escaped_result, vec![r#"example "text" with quotes in it"#, "123123"]);
    }
}

/// See: https://github.com/eulerto/wal2json/issues/221#issuecomment-1025143441
/// View the tests above for examples, and intended functionality.
pub fn parse_postgres_array(array_str: &str) -> JSONValue {
    let result_as_strings: Vec<String> = parse_postgres_array_as_strings(array_str);
    let result_as_json_value_strings = result_as_strings.into_iter().map(|a| serde_json::Value::String(a)).collect();
    let result_within_json_array = serde_json::Value::Array(result_as_json_value_strings);
    result_within_json_array
}
pub fn parse_postgres_array_as_strings(array_str: &str) -> Vec<String> {
    let chars_struct = array_str.chars();
    let chars = chars_struct.collect::<Vec<char>>();
    let mut result_as_strings: Vec<String> = vec![];

    let mut in_quote = false;
    let mut in_entry = false;
    let mut last_char_was_escape_backslash = false;
    //let mut current_entry_str: Option<String> = None;
    let mut current_entry_str: String = String::new(); // empty means none

    /*let mut end_current_entry = || {
        result_as_strings.push(current_entry_str.unwrap());
        current_entry_str = None;
        in_quote = false;
        in_entry = false;
    };*/

    //for (i, ch) in chars.enumerate() {
    //let chars_length = chars.into_iter().count();
    let chars_length = chars.len();
    let mut i = 0;
    for ch in chars {
        match last_char_was_escape_backslash {
            true => {
                last_char_was_escape_backslash = false;
                //current_entry_str.unwrap().push(ch);
                current_entry_str.push(ch);
            }
            false => {
                match ch {
                    '{' if i == 0 => {},
                    '}' if i == chars_length - 1 => {
                        //if current_entry_str.is_some() {
                        if current_entry_str.len() > 0 {
                            //end_current_entry();
                            {
                                /*result_as_strings.push(current_entry_str.unwrap());
                                current_entry_str = None;*/
                                result_as_strings.push(current_entry_str);
                                current_entry_str = String::new();
                                in_quote = false;
                                in_entry = false;
                            }
                        }
                    },
                    '\\' => {
                        last_char_was_escape_backslash = true;
                    },
                    '"' => {
                        in_quote = !in_quote;
                        // if just left a quote
                        if !in_quote {
                            //end_current_entry();
                            {
                                result_as_strings.push(current_entry_str);
                                current_entry_str = String::new();
                                in_quote = false;
                                in_entry = false;
                            }
                        }
                    },
                    // ie. if just left a quote
                    ',' if !in_entry => {},
                    // if hit a separator after a non-quoted entry
                    ',' if in_entry && !in_quote => {
                        //end_current_entry();
                        {
                            result_as_strings.push(current_entry_str);
                            current_entry_str = String::new();
                            in_quote = false;
                            in_entry = false;
                        }
                    },
                    _ => {
                        // if hit start of entry
                        //if current_entry_str.is_none() {
                        if current_entry_str.len() == 0 {
                            //current_entry_str = Some(String::new());
                            current_entry_str = String::new();
                            in_entry = true;
                        }
                        current_entry_str.push(ch);
                    }
                };
            },
        };
        i += 1;
    }
    result_as_strings
}

It seems to work for the basic cases I've tried, but I think it's likely there are cases it fails on.

That's part of why it'd be nice to have wal2json emit valid objects/arrays for jsonb fields -- or at least a valid json string, so callers can parse the data without writing brittle parse functions like the above.

Venryx added a commit to debate-map/app that referenced this issue Jan 30, 2022
…unctionality! (it was more straightforward than I expected)

* Made XLogData message parsing more robust. (old version broke in a case where a single "{" char showed up in the binary portion)
* Added parsing of the weird structure that wal2json uses for representing arrays. (see: eulerto/wal2json#221)
@eulerto
Copy link
Owner

eulerto commented Mar 13, 2022

I tried the following example:

CREATE TABLE xyz (
a integer,
b hstore,
c jsonb,
primary key(a)
);

INSERT INTO xyz (a, b, c) VALUES(1, 'a=>x, b=>y'::hstore, '{"name":"account_no_history","value":"[{"acc":"3597","createdby":"Viztek,Pacs","createddt":"2015-09-24T17:14:30.728Z"}]"}'::jsonb);

This example also covers issue #222 .

In format v1, I get:

{
        "change": [
                {
                        "kind": "insert",
                        "schema": "public",
                        "table": "xyz",
                        "columnnames": ["a", "b", "c"],
                        "columntypes": ["int4", "hstore", "jsonb"],
                        "columnvalues": [1, "\"a\"=>\"x\", \"b\"=>\"y\"", "{\"name\": \"account_no_history\", \"value\": [{\"acc\": \"3597\", \"createdby\": \"Viztek,Pacs\", \"createddt\": \"2015-09-24T17:14:30.728Z\"}]}"]
                }
        ]
}

In format v2, I get:

{"action":"B"}
{"action":"I","schema":"public","table":"xyz","columns":[{"name":"a","type":"integer","value":1},{"name":"b","type":"hstore","value":"\"a\"=>\"x\", \"b\"=>\"y\""},{"name":"c","type":"jsonb","value":"{\"name\": \"account_no_history\", \"value\": [{\"acc\": \"3597\", \"createdby\": \"Viztek,Pacs\", \"createddt\": \"2015-09-24T17:14:30.728Z\"}]}"}]}
{"action":"C"}

What is your version?

@mertezz
Copy link

mertezz commented Aug 29, 2023

We have the same issue with parsing stringified JSON values which makes is a little bit more complex to handle our format before the export.

Is there maybe also a plan to support native native JSON objects for JSON column types?

@fovc9
Copy link

fovc9 commented Jan 15, 2024

Using the v1 format on postgres 15, the output from a jsonb column is correct (though stringified) for me:

create table test(word text, doc jsonb);
commit;
insert into test values ('abc', '[1, 2, null, "abc"]');
commit;

Results in this output

{"change":[{"kind":"insert","schema":"public","table":"test","columnnames":["word","doc"],"columntypes":["text","jsonb"],"columnvalues":["abc","[1, 2, null, \"abc\"]"]}]}

I can run it through jq just fine (first call gets the new jsonb value as string, 2nd call accesses the jsonb value):

$ cat /tmp/wal2json.sample | jq -r '.change[0].columnvalues[1]' | jq
[
  1,
  2,
  null,
  "abc"
]

I used this dockerfile to test:

FROM postgres:15.5-bullseye
RUN apt-get update && apt-get install -y postgresql-15-wal2json
CMD ["postgres", "-c", "wal_level=logical"]

It'd definitely be nice to get jsonb fields "natively" parsed, but I wonder if this ticket should be a feature request instead of a bug?

Edit: Also tested with insert into test values ('abc', '[1, 2, null, "abc", {"x": ["a", "b", 3.1]}]'); to get an object and strings in the json output and it worked correctly again

@dhenson02
Copy link
Author

This issue is talking about JSON columns, not JSONB

@fovc9
Copy link

fovc9 commented Jan 17, 2024

Ah sorry about that. Misread the issue then

edgarrmondragon added a commit to MeltanoLabs/tap-postgres that referenced this issue Aug 28, 2024
Slack thread:
https://meltano.slack.com/archives/C06A1MD6A6L/p1724619858505739

When using log based replication, the wal2json output for columns of
array types returns a string encoded in sql format.
Ex: '{a,b}'

The records produced by the tap in this sutation fails the target schema
validation, since the schema is of type array and the value is a string.

I included a test case for it. Please feel free to modify the PR, I am
by no means a python developer.

Related:

- eulerto/wal2json#221 (comment)

---------

Co-authored-by: Edgar Ramírez Mondragón <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants