Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement decoding of stringref (tag 25 & 256) #71

Closed
wants to merge 13 commits into from

Conversation

BurtHarris
Copy link

@BurtHarris BurtHarris commented Apr 25, 2018

Implement decoding stringref tags (tag 25 and 256) per http://cbor.schmorp.de/stringref

  • Stringref tags can be used to substantially compress CBOR streams containing repeated string values, such as map entry names. This change includes updates to the decoder so that streams encoded using stringref tags will be decoded correctly. I'll submit an encoding change separately.

  • Fixed test in cases.js which accidentally used tag 256 improperly. Changed this case to use the (currently unassigned) tag 1000.

  • Added tests in decoder.ava.js

To operate correctly this must be implemented in the decoder, because special nestable context is established by tag 256 for use by tag 25 during decoding. If no stringref-namespace (tag 256) is encountered in an stream, this change has no effect on decoding.

Ref #70

per <http://cbor.schmorp.de/stringref>
To operate correctly this must be implemented in the decoder, because special nestable context is established by tag 256 for use by tag 25.
Fixed test in cases.js which accidentially used tag 256 improperly.
Added tests in decoder.ava.js
@coveralls
Copy link

coveralls commented Apr 25, 2018

Coverage Status

Coverage increased (+0.005%) to 99.808% when pulling c8d3b58 on BurtHarris:stringref into f35349c on hildjj:master.

@BurtHarris
Copy link
Author

I've now got the coverage % increased rather than decreased.

Burt Harris added 3 commits April 24, 2018 20:09
It was generating errors on some versions of Node.js.
Probably promise related.
Add tests of invalid stringref inputs
@BurtHarris
Copy link
Author

OK, this is ready for a code review. Decode support only (so far.)

Burt Harris added 7 commits April 26, 2018 11:42
per <http://cbor.schmorp.de/stringref>
To operate correctly this must be implemented in the decoder, because special nestable context is established by tag 256 for use by tag 25.
Fixed test in cases.js which accidentially used tag 256 improperly.
Added tests in decoder.ava.js
It was generating errors on some versions of Node.js.
Probably promise related.
Add tests of invalid stringref inputs
@hildjj
Copy link
Owner

hildjj commented Apr 29, 2018

OK, I've read the spec you linked to, and the parts I understand I disagree with almost completely:

  • (Almost) all strings take up encoding space, except for ones that are too short. That approach is fraught with edge case that will lead to interop problems.
  • Even value strings that are unlikely to be repeated take up encoding space.
  • They nest. Let's learn our lesson from XML namespaces that we rarely need that complexity.
  • That spec needs a bunch more examples with more annotations to make it quicker to grasp

My alternate proposal (which I did code up enough to prove that it works, at one point):

  • 25(["foo", 0]) means "foo", but defines atom 0 to mean "foo"
  • 25(0) means "foo", as a reference to the atom defined above

as with your suggested approach, you need two passes through the input to calculate the best compression, but that's straightforward:

  1. keep map of string -> count
  2. drop all strings with count < 4 (ish. whatever the overhead of defining the atom is)
  3. sort by string.length * count
  4. assign each an atom number from the series (0, 1, -1, 2, -2, ...) (note: (1-(2*n+1)*Math.pow(-1, n)) / 4). If the amount of compression doesn't warrant an atom at the point it is assigned, skip it if you like, but it doesn't hurt anything if it's assigned.
  5. (optional) save atom mapping for next time, so you can skip this step with similar data
  6. take real encoding pass. The first time you see a string that has an atom, encode it as an array. Subsequent times, encode it as an integer.

All of the complexity is in the encoder, and that complexity can be amortized over multiple runs if you like.

@hildjj
Copy link
Owner

hildjj commented Apr 29, 2018

However, if multiple people want this (thumbs-up the original PR comment to indicate), I'll merge it anyway.

@BurtHarris
Copy link
Author

BurtHarris commented Apr 30, 2018

@hildjj I'm not the originator of the tag 25/256 spec, but it's been an acknowledged extension of CBOR since RFC 7049 was published. Like you, I could suggest improvements to it, but thought I'd start with an implementation based on that existing tag 25/256 spec, rather than try to overload tag 25 with multiple meanings.

Annother alternate approach, using tag 28/29 on strings does something roughly equivalent to the alternate you propose. Plus it would allow smart one-pass encoding by tagging only those strings which are already interned by JavaScript. (This would cover most object fields since any string appearing in JavaScript code automatically gets interned.) Tag 28/29 is also is also how you would address the non-hierarchal data problems (objects references and cycles), and the decode logic could be shared, but good encoding of references/cycles is a more complex problem than interned strings, shared objects will generally require two passes.

Both tag 25/256 and 28/29 are a bit different from other tags in that they really need to be built into to core encoder/decoder logic. Neither seems perfect, but some experience implementing and using them might lead to a better alternative than either. But all that seems like subject for a separate design after experimenting some with the existing ones that Mark Lehmann already drew up.

Dynamic space-optimized CBOR would probably be best addressed by a gzip postprocessing, but the impact on complexity and speed isn't that attractive to me right now, and even with gzip the value/string sharing would probably still improve compression.

@BurtHarris
Copy link
Author

Till you make up your mind about this, I'm going to postpone implementing the encoder changes. Let me know what you think about the tag 28/29 concept.

Base automatically changed from master to main February 22, 2021 07:11
@hildjj
Copy link
Owner

hildjj commented Sep 13, 2022

If you still want this, the best thing to do is to create a plugin package like cbor-bigdecimal. I'll even take it as a patch to this repo so you don't have to publish and maintain it separately.

I'm going to close this PR, however, since that changeset will be pretty drastically different than this one. Please submit a new PR with the plugin approach if you like.

@hildjj hildjj closed this Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants