-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce pretty printing tool. #89
Conversation
I wonder if you couldn't manage to fit something semantically into a json or toml or such format, while still achieving these semantics. perhaps by sticking the extra type information in comments or in a |
I think the answer to "can I cram X into Y" is, generally, "yes". The question is whether it results in ergonomics that are pleasant, and (in this case) whether it communicates the thing I want to communicate. (Could we reserve magic keys? Sure. But it would be broken for some data which conflicts with the magic keys, despite that data being valid, so it would communicate horribly backwards things about how seriously we take isomorphism and completeness in this project.) (Could we double the depth of the whole tree, and cram metadata in ever odd-numbered depth? Definitely. But it would start to suck horribly for human reading in demos and debug printf'ing, which is a primary objective here.) And technicalities aside, a goal I have for a pretty printer is to be very obviously not aimed for being parseable again. It should scream "I'm a debug dump!" and ooze "This is not a serious codec!" from every single line. I dunno, maybe I'll regret this in the light of the morning... but at the moment I'm feeling that using a real codec as the debug printf format would be a mistake even if it wasn't technically problematic. It might convince an incautious user that doing |
I should also add that this currently doesn't implement elision of deep nodes, or elisions of the middle of long lists, long bytes or strings, or big maps, but I think it probably should. This would be useful for when you want a quick check of "wtf is this variable" as you're hacking, but don't want to overwhelm your terminal. I don't know whether we'd want to default any of those elisions to "on" in if we started using this (or something like it) as Stringers for nodes, but... maybe, yes. |
I'm not at all attached to this particular set of concatenations, either. (Saying " |
This is quite nice for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong opinions on "plaintext" vs json, but I do agree that reusing something that already exists might be nicer. I'm especially worried about all the extra code this seems to be adding.
pretty/marshal.go
Outdated
case false: | ||
st.write("bool node: false") | ||
return st.checkErr() | ||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this case is quite literally unreachable though, so you can just remove it
or even better, use an if/else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like to stick to uses switch cases when doing marshal/unmarshal stuff.
But it encounters a problem on bools: the golang compiler doesn't currently seem to recognize that booleans are exhausted after these two cases, and insists on this panic. If the panic isn't present, the whole function is considered to have a branch that doesn't have a return statement!
I suppose I might as well change to if
, though, yes.
pretty/marshal.go
Outdated
_, st.err = st.w.Write(b) | ||
} | ||
} | ||
func (st *encodeState) checkErr() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I get the point of this func
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I should remove this. I must've expected it would do more work at first, and then didn't "gc" when it didn't turn out to be so.
I think that's a valid point, but you're not implementing any |
To add one more idea - perhaps we could export a |
In the light of the morning, I'll admit I'm not entirely a fan of everything I did here. I'm going to try to break down some lessons learned and comment on those here, and what happens to the code, well, we'll see. First off, This is not actually particularly pretty. Nuff said. I'm not attached to the output here. The structure of the code might be fine, but the actual series of glyphs that get output are not very visually appealing. The observation that we might want single-liner outputs for many debug purposes is probably true. That's sort of possible with this format, but not entirely unambiguous. (The output for Bytes in particular would look ridiculous if the Generally speaking, if we keep this code, I'd probably expect to iterate on the output. I do still think the idea of a non-codec is useful -- both for the {it sometimes might be bad to use a codec because that communicates the wrong thing} reason, and also for the {i'd like to have debug output that tells me more than a data model tree would, because it's a debug output} reason. But that's not to say this format is really nailing that. Or rather, it is hitting those two goals, but there's a lot of other outputs that would also hit those goals, and be better on other axis at the same time. I've also put some further thought into "Well, what if we made a codec that just supported comment fields or something ignored when parsing back, and used that for the extra debug info?" and I'm still not feeling warm towards that -- the odds that such a codec would miscommunicate critical concepts about what's datamodel vs what's implementation detail looms big in my concerns. (Crockford's opinions about comments in JSON also loom sizable in my mind, hereabouts -- I think it had some solid points.) I dunno. Despite my misgivings, maybe it's a least-bad sort of thing. I was hoping this would be interesting research for build-a-codec... but it either failed to be useful input, or, the answer is no. (I'm still not entirely clear on which; perhaps a bit of both.) The overall structure of the code certainly resembles the same sort of switch statements that appear in marshaller functions. But extracting some sort of gadget which would be constructed with a bunch of callbacks for all those case bodies... didn't seem at all valuable, once I got going. The amount of additional code that would've added would be substantial for both the dreaming-of-reusability part and for the actual use of it; there's really no win visible there. I'm not confident the lack of desirability of a build-a-codec gadget here is a downvote for the general case, though. This prettyprint function is exceptional in at least a couple of ways that are relevant: it doesn't make use of any tokenization abstractions (whereas codecs typically (if not always) do); and it doesn't have any resource limits attached to it (at least, again, not in the same way codecs do), both because those are less applicable during marshal (and we didn't build an unmarshal!), and also simply because... why would you bother with those for a debug function? So, the opportunity for sharable defaults was limited; and the tricky bits that make for a desire to have reusable core components also weren't a concern. We did end up with a The string encoding function... is actually the one part here that I am extremely enthusiastic about. (Believe it or not.) The docs on that function may seem a little flippant, but I did not make that function flippantly. The key detail of this string escaping function is that when it encounters non-UTF sequences, it escapes the bytes as it finds them. If we compare this to the JSON string encode function found in Golang's standard library, for example, it's consequentially different. The tests relating to that feature are also pretty clear that the behavior is lossy, and this is considered intentional: https://github.com/golang/go/blob/114719e16e9681bd1001326598ededa719c17944/src/encoding/json/decode_test.go#L670-L680 Now, mind, I'm not particularly mad at the golang implementation referenced here. A restriction to '\u' escapes is necessarily[‡] lossy. The I do not think that an escaping function that cannot describe all byte sequences is an acceptable string escaping function; and therefore I don't think what JSON specifies about escaping is good. I realize that's a powerful statement. I am indeed making it, though. It is not 8-bit clean, and I do not like that. This is something I feel fairly strongly about. After staring at the string situation in various guises over the last couple of years -- in the context of filesystems and the unixfs specs, in the context of serialization determinism, in the context of unicode canonicalization schemes, but most importantly and simply in the context and goal of not having lossy functions -- I've come to believe fairly adamantly that a string escaping function that's unable to encode all 8-bit bytes without loss is a catastrophic mistake. The string encoding function here simply encodes non-unicode bytes as [‡] - It may be that the claim " So! That was probably more text than you expected about that string escaping function. Some of this I probably shouldn't leave buried in a comment on this PR, because it's fairly consequential for IPLD specifications as a whole. Expect to hear more about this later and elsewhere; but for now it is at least here. |
I don't really want to make this all about UTF8 but I wouldn't mind hearing more on why you think I was wondering about how much you could extend this debugging stringifier to schema types. Can you see a logical path from what you have here to a format that also embeds extra data gathered through the schema lens? That might get kind of interesting if you can start to hang additional metadata on top of it. As it is now, I can't really see what else you can add that an existing codec (like dag-json) couldn't since it's all just data model. |
We should probably open a discussion over in the specs repo for this string discussion, but in brief: I wrote most of that screed before looking at UTF8-C8 again in detail, just to confess that first of all :) But now that I have looked at it more... I think "\xXX" escaping is still preferable where it's possible, yes.
So it makes sense to use UTF8-C8 if there's no other choice -- as would appear to be the case in JSON, for example, since the JSON spec says |
That's... a really good question. Something like that would take us way out of codec territory, and also probably be very useful. I don't know if I can visualize anything at the moment though. It would get really tricky anytime there's a representation strategy involved that goes from recursive structure on one side to scalar on the other (like structs with stringjoin representations, for example). Just printing the two things side by side might be one of the clearer options. It would be cool to make a debug format for this though, agree. I just don't know if I'm clever enough!
Well, the part of this PR that makes it "not data model" is: It's a tiny thing, but "not data model" is a rubicon line, and encoding that little bit of extra info crosses it. |
Also don't want to make it all about utf8, but this claim doesn't stand on its own
Consider that bytes will show up in pathing selectors. The difference there would be: The top one looks like 😎 perl from the turn of the century |
Are you trying to print any byte sequence string, or only valid unicode (utf-8) strings? If you only care about valid unicode, JSON would be fine for you. Since you seem to want any byte sequence, then go for https://golang.org/pkg/strconv/#Quote. This will just work, because Go strings can contain anything, including null bytes. And, similarly to Clean8, it will only use character escapes when necessary. This would be Go-specific, but you're writing a pretty-printer for Go nodes, so I think that's fine. If you do want it to be cross-language and standard, then I agree with the others that you should pick an existing string encoding standard, like Clean8 or JSON or whatever works for you. Coming up with a new standard seems unnecessary. |
Any. Definitely any. Reading now. Looks valid aka lossless. The most relevant sections of source appear to be approximately:
Interesting. Tries to use I'm... indifferent to this. This may deserve more reflection, but my first impression is: I don't actually care for quoting with To carry on with that "in most cases" thought: There are roughly three scenarios I can see being concerned with:
Situation 1 is most of IPLD. Situation 2 is what happens if something like the JSON spec straightjackets the possible escaping. Situation 3 doesn't really come up in IPLD (or if it does, we'd rather disregard it anyway because can't really entirely predict where it will come up, because it's based on user data and moods, and we're not in the business of guessing on those things because it tends to wreck make things more complicated). I'm not really sure I understand what situations exist where I'd want to escape some parts of unicode out of distrust for the thing rendering it. If there are some examples of these situations, I'd be interested. (I know golang JSON again provides interesting reference here: it specialcases |
I should clarify: I am not intending to push this to merge anytime soon, at this point. This PR has become the fuse-lighter for strings vs unicode discussions for IPLD, and I'm okay with that outcome. But we should move it out of this implementation PR and into our discussions during the community call tonight and in the specs repo in the future. I'll consider this PR roughly blocked until that conversation advances in some direction. |
I'm going to do a merge-ignore on this (more literally, The content explored here (and especially, the discussions generated) have had some merit and are worth keeping. But most of the good ideas have been picked up and carried further by other PRs that were written with the lessons learned here. So, I don't see any reason to try to carry this further. (If anything, it can be rebooted with a fresh take in the future.) |
The original idea of this branch was to explore some reusable components for codecs; maybe to actually get a useful prettyprinter; and... actually turned a lot into being a bit of practical discovery about string encoding and escaping. Some of those goals were partially accomplished, but in general, this seems better to chalk up as a learning experience. #89 (comment) already does a good job of discussing why, and what as learned. A lot of the reusable codec components stuff has also now shaken out, just... in other PRs that were written after the learnings here. Namely, #101 was able to introduce some tree transformers; and then #112 demonstrates how those can compose into a complete codec end to end. There's still work to go on these, too, but they seem to have already grabbed the concept of reusable parts I was hoping for here and gotten farther with it, so. These diffs are interesting enough I want to keep them referencable in history. But I'm merging them with the "-s ours" strategy, so that the diffs don't actually land any impact on master. These commits are for reference only.
f5669aa
to
1d1fc49
Compare
Introduce a pretty printing tool.
I think this, or a variant of it, may be reasonable to rig up as a Stringer on the basicnode types, and recommend for other Node implementations to use as their Stringer too.
It's a fairly verbose output: I'm mostly aiming to use it in examples. There's a lot of information that's redundant.
Bytes in particular are fun: I decided to make them use the hex.Dump format. (Why not?)
I've put this in a codec sub-package, because in some ways it feels like a codec(I've corrected this, it was a huge mistake; it's now its own packge outside of the codec subtree) -- it's something you can apply to any node, and it streams data out to an io.Writer -- but it's also worth noting it's not meant to be a multicodec or generally written with an intention of use anywhere outside of debug printf sorts of uses.The codectools package also introduced here, although it only has this one user, is a reaction to previous scenarios where I've wanted a quick debug method and desperately wanted something that gives me reasonable quoted strings... without reaching for a json package. We'll see if it's worth it over time; I'm betting yes, but not with infinite confidence. (This particular string escaping function it provides also has some particular details that make it distinctive from using json: it has the benefit of encoding even non-utf-8 strings without loss of information -- which is noteworthy, because I've recently noticed JSON does not; and I feel that's a bit yikes.)
This could be a solution to #88 . However, I've not yet hooked anything up to use this as a Stringer.
You can look at the
fluent/reflect_example_test.go
file for what this looks like to use.