Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I assume that pandoc-api-version always appears first? #3211

Closed
sergiocorreia opened this issue Nov 3, 2016 · 15 comments
Closed

Can I assume that pandoc-api-version always appears first? #3211

sergiocorreia opened this issue Nov 3, 2016 · 15 comments

Comments

@sergiocorreia
Copy link

On my system (and on the TravisCI systems), Pandoc outputs a predictable JSON:

>>> echo a | pandoc --to=json
{"pandoc-api-version":[1,17,0,4],"meta":{},"blocks":[{"t":"Para","c":[{"t":"Str","c":"a"}]}]}

However, I've got some reports * that this is not always the case:

>>> cat pftest.md | pandoc -t json
{"blocks":[{"t":"Para","c" ...

In both cases we had the latest Pandoc version:

    pandoc 1.18
    Compiled with pandoc-types 1.17.0.4, texmath 0.8.6.6, highlighting-kate 0.6.3

Is this a bug, or is there no guarantee about the order of the elements in a map? I monkey-patched panflute to work around different orders at the first level, but doing the same for all the "t" and "c" items would be a bit messy..

Thanks,
S

@mb21
Copy link
Collaborator

mb21 commented Nov 3, 2016

The JSON format defines objects as (emphasis added):

An object is an unordered collection of zero or more name/value
pairs, where a name is a string and a value is a string, number,
boolean, null, object, or array.

So no, it's not guaranteed that a certain name (like pandoc-api-version) in an object is always first. But this has nothing to do with pandoc, just the way JSON is defined...

@sergiocorreia
Copy link
Author

sergiocorreia commented Nov 3, 2016

Sure, it's unordered, but at the same time Pandoc controls the implementation which is why I'm asking. For instance, I haven't had a single instance of 't' appearing after 'c' , and I relied on that to speed up the JSON decoding significantly. Other advantages are:

  • It's easier to test if a filter package (pandocfilters, panflute, etc.) is correct if it's idempotent wrt the JSON input (I just see if two files are the same).
  • It's easier to diff (you can use standard git diff instead of specialized JSON diffs)
  • It's faster to process. If pandoc-api-version appears early on, I can read that and process the data conditional on the version

I don't know if writing a deterministic JSON is trivial on the Haskell side or not. If it's a big task and would mess up the codebase, then the clear choice would be to rewrite parts of panflute to be order-agnostic, even at the cost of speed. If it's completely trivial, then it might even make it easier for Pandoc when debugging.

On a related note, JSON feels quite verbose compared to native format, perhaps something like messagePack, or taking out the "t"/"c" tags could help in the longer term

@jkr
Copy link
Collaborator

jkr commented Nov 4, 2016

No, you can't assume it. Pandoc uses the aeson library which doesn't guarantee it. In our tests, we saw the order change under windows, but not under linux/osx. In any case, since it's not part of the JSON spec, we can't expect it of other output.

@sergiocorreia
Copy link
Author

Sounds reasonable

@jgm
Copy link
Owner

jgm commented Nov 6, 2016

We could use aeson-pretty to make the order deterministic
and also make the JSON human-readable in other ways.

https://hackage.haskell.org/package/aeson-pretty-0.8.2/docs/Data-Aeson-Encode-Pretty.html

If we set indent to 0, we could get the deterministic order
without the extra whitespace.

Not sure it's worth it though. Anyone who consumes
pandoc JSON should use a JSON library rather than writing an
ad hoc parser.

@jkr
Copy link
Collaborator

jkr commented Nov 6, 2016

Not sure it's worth it though. Anyone who consumes
pandoc JSON should use a JSON library rather than writing an
ad hoc parser.
My thoughts exactly. Plus, piping filters wouldn't necessarily work unless other lib writers guaranteed it too. Which would mean they couldn't just use a JSON lib.

@mb21
Copy link
Collaborator

mb21 commented Nov 7, 2016

Anyone who consumes pandoc JSON should use a JSON library rather than writing an
ad hoc parser.

Agreed.

If we're discussing the JSON format already however (and apologies if this was discussed somewhere before, maybe along @jkr's rewrite of the JSON handling...?): Has a more self-contained and more self-explanatory JSON serialization of the pandoc AST been considered? For example instead of:

[{
  "t": "Header",
  "c": [1, ["foo-bar", [],
      [
        ["mykey", "myval"]
      ]
    ],
    [{
      "t": "Str",
      "c": "foo"
    }, {
      "t": "Space"
    }, {
      "t": "Str",
      "c": "bar"
    }]
  ]
}]

something like this:

[{
  t: "Header"
, level: 1
, id: "foo-bar"
, class: []
, kvs: {
    mykey: "myval"
  }
, text: [{
    t: "Str"
  , c: "foo bar"
  }]
}]

With this format we could easily add more fields to elements. And filters that don't know those fields would simply ignore them without breaking.

@jgm
Copy link
Owner

jgm commented Nov 7, 2016

+++ Mauro Bieg [Nov 07 16 01:14 ]:

something like this:
[{
t: "Header"
, level: 1
, id: "foo-bar"
, class: []
, kvs: {
mykey: "myval"
}
, text: [{
t: "Str"
, c: "foo bar"
}]
}]

With this format we could easily add more fields to elements. And
filters that don't know those fields would simply ignore them without
breaking.

That's a great idea, and it's a shame we didn't consider
this when making breaking changes in pandoc-types 1.17.x.

@jkr what do you think?

@sergiocorreia
Copy link
Author

Anyone who consumes
pandoc JSON should use a JSON library rather than writing an
ad hoc parser.

Definitely, and AFAIK everyone is doing that. However, in order to maintain the order of the loaded JSON, python allows objects to be passed as tuples instead of dicts. But I agree that depending on features not in the JSON standard would probably cause headaches down the line.

About Mauro's suggestion; agree that it makes reading and debugging the JSON way easier.

@jkr
Copy link
Collaborator

jkr commented Nov 7, 2016

@jkr what do you think?

Looks clearer to me. The only issue I have is that it breaks the "t"/"c" model. I'm not sure if that's a big deal, but if we have "text" as a key here, why not have "text" as a key in "Str" too?

But I would like to replace an array with an object wherever possible. I guess this seems more consistent to me:

{t: "Header"
 c: {level: 1,
     attr: {id: "foo-bar"
        , class: []
        , kvs: {
        mykey: "myval"
        }
       },
     text: [{t: "Str",
         c: "foo bar"
        }
       ]
    }
}

@mb21
Copy link
Collaborator

mb21 commented Nov 8, 2016

But I would like to replace an array with an object wherever possible.

Well, I'd use an array when you have a list of things and an object when you have key-value pairs...

if we have "text" as a key here, why not have "text" as a key in "Str" too?

Yeah, that was not really though through. The idea is to keep the JSON as simple as possible and only nest when you have child elements in the document tree (usually [Block] or [Inline]), i.e. only nest when you would have child elements in the generated HTML. Whether the field that does the nesting is named content, c or children I don't really care (text was not such a good idea I guess, since it can also contain other things than text). I'm not sure whether it should also be named content when you reach the leaves of the document tree (usually the Str element), since there the value is actually a JSON string, not an array of JSON objects, but why not... for a list, it's also an array of arrays.

With that in mind, here a fleshed out version (I'm using type and content here, but t and c is fine with me):

{
  pandoc-api-version: "1.17.1.0"
, meta: {}
, content: [{               //corresponds to [Block]
      type: "Header"
    , level: 1
    , id: "foo-bar"
    , class: []
    , kvs: {
        mykey: "myval"
      }
    , content: [{           //corresponds to [Inline]
        type: "Str"
      , content: "foo bar"  //corresponds to String
      }]
  }, {
      type: "Para"
    , content: [{
        type: "Str"
      , content: "my string"
      }]
  }, {
      type: "OrderedList"
    , startNumber: 1
    , style: "Decimal"
    , delim: "Period"
    , content: [[{          //corresponds to [[Block]]
        type: "Plain"
      , content: [{         //corresponds to [Inline]
          type: "Str"
        , content: "my list item text"
        }]
      }]]
    }
  ]
}

@jkr
Copy link
Collaborator

jkr commented Nov 8, 2016

Well, I'd use an array when you have a list of things and an object when you have key-value pairs...

Yeah, I wasn't clear. I prefer objects to arrays when you're modeling structure: keys are better than having to figure out what something is by its position in a list. (It used to be that meta was json[0] and blocks was json[1], for example).

I'd be curious to see what sort of slowdown typing out "text" and "content" produces on large files running through filters. On a few of my tests, removing the empty lists ("c":[]) produced not-insignificant speed gains in itself.

@sergiocorreia
Copy link
Author

sergiocorreia commented Nov 8, 2016

I'd be curious to see what sort of slowdown typing out "text" and "content" produces on large files running through filters

A large slowdown would be my guess. Removing the empty lists reduced the size of some of my test files from ~ 1.8mb to 200kb, and extending the names of common fields would do the opposite.

@dhimmel
Copy link

dhimmel commented Mar 18, 2019

We could use aeson-pretty to make the order deterministic and also make the JSON human-readable in other ways.
If we set indent to 0, we could get the deterministic order without the extra whitespace.

As of Pandoc 2.7.1 is JSON AST output deterministic? We are thinking of prettifying the JSON and tracking it using git. Therefore, deterministic ordering would be greatly helpful for having clean diffs. We could always sort by keys, but I imagine there is actually some intelligible order to the JSON that is best to preserve if possible.

@jgm
Copy link
Owner

jgm commented Mar 18, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants