Can I assume that `pandoc-api-version` always appears first? #3211

sergiocorreia · 2016-11-03T19:40:17Z

On my system (and on the TravisCI systems), Pandoc outputs a predictable JSON:

>>> echo a | pandoc --to=json
{"pandoc-api-version":[1,17,0,4],"meta":{},"blocks":[{"t":"Para","c":[{"t":"Str","c":"a"}]}]}

However, I've got some reports * that this is not always the case:

>>> cat pftest.md | pandoc -t json
{"blocks":[{"t":"Para","c" ...

In both cases we had the latest Pandoc version:

    pandoc 1.18
    Compiled with pandoc-types 1.17.0.4, texmath 0.8.6.6, highlighting-kate 0.6.3

Is this a bug, or is there no guarantee about the order of the elements in a map? I monkey-patched panflute to work around different orders at the first level, but doing the same for all the "t" and "c" items would be a bit messy..

Thanks,
S

The text was updated successfully, but these errors were encountered:

mb21 · 2016-11-03T21:48:05Z

The JSON format defines objects as (emphasis added):

An object is an unordered collection of zero or more name/value
pairs, where a name is a string and a value is a string, number,
boolean, null, object, or array.

So no, it's not guaranteed that a certain name (like pandoc-api-version) in an object is always first. But this has nothing to do with pandoc, just the way JSON is defined...

sergiocorreia · 2016-11-03T22:09:28Z

Sure, it's unordered, but at the same time Pandoc controls the implementation which is why I'm asking. For instance, I haven't had a single instance of 't' appearing after 'c' , and I relied on that to speed up the JSON decoding significantly. Other advantages are:

It's easier to test if a filter package (pandocfilters, panflute, etc.) is correct if it's idempotent wrt the JSON input (I just see if two files are the same).
It's easier to diff (you can use standard git diff instead of specialized JSON diffs)
It's faster to process. If pandoc-api-version appears early on, I can read that and process the data conditional on the version

I don't know if writing a deterministic JSON is trivial on the Haskell side or not. If it's a big task and would mess up the codebase, then the clear choice would be to rewrite parts of panflute to be order-agnostic, even at the cost of speed. If it's completely trivial, then it might even make it easier for Pandoc when debugging.

On a related note, JSON feels quite verbose compared to native format, perhaps something like messagePack, or taking out the "t"/"c" tags could help in the longer term

jkr · 2016-11-04T08:23:29Z

No, you can't assume it. Pandoc uses the aeson library which doesn't guarantee it. In our tests, we saw the order change under windows, but not under linux/osx. In any case, since it's not part of the JSON spec, we can't expect it of other output.

sergiocorreia · 2016-11-04T13:52:12Z

Sounds reasonable

jgm · 2016-11-06T20:22:04Z

We could use aeson-pretty to make the order deterministic
and also make the JSON human-readable in other ways.

https://hackage.haskell.org/package/aeson-pretty-0.8.2/docs/Data-Aeson-Encode-Pretty.html

If we set indent to 0, we could get the deterministic order
without the extra whitespace.

Not sure it's worth it though. Anyone who consumes
pandoc JSON should use a JSON library rather than writing an
ad hoc parser.

jkr · 2016-11-06T22:52:39Z

Not sure it's worth it though. Anyone who consumes
pandoc JSON should use a JSON library rather than writing an
ad hoc parser.
My thoughts exactly. Plus, piping filters wouldn't necessarily work unless other lib writers guaranteed it too. Which would mean they couldn't just use a JSON lib.

mb21 · 2016-11-07T09:14:59Z

Anyone who consumes pandoc JSON should use a JSON library rather than writing an
ad hoc parser.

Agreed.

If we're discussing the JSON format already however (and apologies if this was discussed somewhere before, maybe along @jkr's rewrite of the JSON handling...?): Has a more self-contained and more self-explanatory JSON serialization of the pandoc AST been considered? For example instead of:

[{
  "t": "Header",
  "c": [1, ["foo-bar", [],
      [
        ["mykey", "myval"]
      ]
    ],
    [{
      "t": "Str",
      "c": "foo"
    }, {
      "t": "Space"
    }, {
      "t": "Str",
      "c": "bar"
    }]
  ]
}]

something like this:

[{
  t: "Header"
, level: 1
, id: "foo-bar"
, class: []
, kvs: {
    mykey: "myval"
  }
, text: [{
    t: "Str"
  , c: "foo bar"
  }]
}]

With this format we could easily add more fields to elements. And filters that don't know those fields would simply ignore them without breaking.

jgm · 2016-11-07T09:34:48Z

+++ Mauro Bieg [Nov 07 16 01:14 ]:

something like this:
[{
t: "Header"
, level: 1
, id: "foo-bar"
, class: []
, kvs: {
mykey: "myval"
}
, text: [{
t: "Str"
, c: "foo bar"
}]
}]

With this format we could easily add more fields to elements. And
filters that don't know those fields would simply ignore them without
breaking.

That's a great idea, and it's a shame we didn't consider
this when making breaking changes in pandoc-types 1.17.x.

@jkr what do you think?

sergiocorreia · 2016-11-07T15:08:14Z

Anyone who consumes
pandoc JSON should use a JSON library rather than writing an
ad hoc parser.

Definitely, and AFAIK everyone is doing that. However, in order to maintain the order of the loaded JSON, python allows objects to be passed as tuples instead of dicts. But I agree that depending on features not in the JSON standard would probably cause headaches down the line.

About Mauro's suggestion; agree that it makes reading and debugging the JSON way easier.

jkr · 2016-11-07T15:28:57Z

@jkr what do you think?

Looks clearer to me. The only issue I have is that it breaks the "t"/"c" model. I'm not sure if that's a big deal, but if we have "text" as a key here, why not have "text" as a key in "Str" too?

But I would like to replace an array with an object wherever possible. I guess this seems more consistent to me:

{t: "Header"
 c: {level: 1,
     attr: {id: "foo-bar"
        , class: []
        , kvs: {
        mykey: "myval"
        }
       },
     text: [{t: "Str",
         c: "foo bar"
        }
       ]
    }
}

mb21 · 2016-11-08T10:17:29Z

But I would like to replace an array with an object wherever possible.

Well, I'd use an array when you have a list of things and an object when you have key-value pairs...

if we have "text" as a key here, why not have "text" as a key in "Str" too?

Yeah, that was not really though through. The idea is to keep the JSON as simple as possible and only nest when you have child elements in the document tree (usually [Block] or [Inline]), i.e. only nest when you would have child elements in the generated HTML. Whether the field that does the nesting is named content, c or children I don't really care (text was not such a good idea I guess, since it can also contain other things than text). I'm not sure whether it should also be named content when you reach the leaves of the document tree (usually the Str element), since there the value is actually a JSON string, not an array of JSON objects, but why not... for a list, it's also an array of arrays.

With that in mind, here a fleshed out version (I'm using type and content here, but t and c is fine with me):

{
  pandoc-api-version: "1.17.1.0"
, meta: {}
, content: [{               //corresponds to [Block]
      type: "Header"
    , level: 1
    , id: "foo-bar"
    , class: []
    , kvs: {
        mykey: "myval"
      }
    , content: [{           //corresponds to [Inline]
        type: "Str"
      , content: "foo bar"  //corresponds to String
      }]
  }, {
      type: "Para"
    , content: [{
        type: "Str"
      , content: "my string"
      }]
  }, {
      type: "OrderedList"
    , startNumber: 1
    , style: "Decimal"
    , delim: "Period"
    , content: [[{          //corresponds to [[Block]]
        type: "Plain"
      , content: [{         //corresponds to [Inline]
          type: "Str"
        , content: "my list item text"
        }]
      }]]
    }
  ]
}

jkr · 2016-11-08T16:04:50Z

Well, I'd use an array when you have a list of things and an object when you have key-value pairs...

Yeah, I wasn't clear. I prefer objects to arrays when you're modeling structure: keys are better than having to figure out what something is by its position in a list. (It used to be that meta was json[0] and blocks was json[1], for example).

I'd be curious to see what sort of slowdown typing out "text" and "content" produces on large files running through filters. On a few of my tests, removing the empty lists ("c":[]) produced not-insignificant speed gains in itself.

sergiocorreia · 2016-11-08T16:16:07Z

I'd be curious to see what sort of slowdown typing out "text" and "content" produces on large files running through filters

A large slowdown would be my guess. Removing the empty lists reduced the size of some of my test files from ~ 1.8mb to 200kb, and extending the names of common fields would do the opposite.

dhimmel · 2019-03-18T16:02:45Z

We could use aeson-pretty to make the order deterministic and also make the JSON human-readable in other ways.
If we set indent to 0, we could get the deterministic order without the extra whitespace.

As of Pandoc 2.7.1 is JSON AST output deterministic? We are thinking of prettifying the JSON and tracking it using git. Therefore, deterministic ordering would be greatly helpful for having clean diffs. We could always sort by keys, but I imagine there is actually some intelligible order to the JSON that is best to preserve if possible.

jgm · 2019-03-18T16:36:43Z

I don't think the output is completely deterministic, because of maps in YAML metadata. (JSON serialization of hash maps is not deterministic.) Daniel Himmelstein <[email protected]> writes:

…

> We could use aeson-pretty to make the order deterministic and also make the JSON human-readable in other ways. If we set indent to 0, we could get the deterministic order without the extra whitespace. As of Pandoc 2.7.1 is JSON AST output deterministic? We are thinking of prettifying the JSON and tracking it using git. Therefore, deterministic ordering would be greatly helpful for having clean diffs. We could always sort by keys, but I imagine there is actually some intelligible order to the JSON that is best to preserve if possible -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #3211 (comment)

sergiocorreia closed this as completed Nov 4, 2016

dhimmel mentioned this issue Mar 18, 2019

build: common intermediate JSON for all pandoc outputs manubot/rootstock#196

Closed

mb21 mentioned this issue Aug 29, 2019

Markdown footnotes are duplicated #1603

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I assume that `pandoc-api-version` always appears first? #3211

Can I assume that `pandoc-api-version` always appears first? #3211

sergiocorreia commented Nov 3, 2016

mb21 commented Nov 3, 2016 •

edited

Loading

sergiocorreia commented Nov 3, 2016 •

edited

Loading

jkr commented Nov 4, 2016

sergiocorreia commented Nov 4, 2016

jgm commented Nov 6, 2016

jkr commented Nov 6, 2016

mb21 commented Nov 7, 2016

jgm commented Nov 7, 2016

sergiocorreia commented Nov 7, 2016

jkr commented Nov 7, 2016 •

edited

Loading

mb21 commented Nov 8, 2016

jkr commented Nov 8, 2016

sergiocorreia commented Nov 8, 2016 •

edited

Loading

dhimmel commented Mar 18, 2019 •

edited

Loading

jgm commented Mar 18, 2019 via email

Can I assume that pandoc-api-version always appears first? #3211

Can I assume that pandoc-api-version always appears first? #3211

Comments

sergiocorreia commented Nov 3, 2016

mb21 commented Nov 3, 2016 • edited Loading

sergiocorreia commented Nov 3, 2016 • edited Loading

jkr commented Nov 4, 2016

sergiocorreia commented Nov 4, 2016

jgm commented Nov 6, 2016

jkr commented Nov 6, 2016

mb21 commented Nov 7, 2016

jgm commented Nov 7, 2016

sergiocorreia commented Nov 7, 2016

jkr commented Nov 7, 2016 • edited Loading

mb21 commented Nov 8, 2016

jkr commented Nov 8, 2016

sergiocorreia commented Nov 8, 2016 • edited Loading

dhimmel commented Mar 18, 2019 • edited Loading

jgm commented Mar 18, 2019 via email

Can I assume that `pandoc-api-version` always appears first? #3211

Can I assume that `pandoc-api-version` always appears first? #3211

mb21 commented Nov 3, 2016 •

edited

Loading

sergiocorreia commented Nov 3, 2016 •

edited

Loading

jkr commented Nov 7, 2016 •

edited

Loading

sergiocorreia commented Nov 8, 2016 •

edited

Loading

dhimmel commented Mar 18, 2019 •

edited

Loading