Implement decoding of stringref (tag 25 & 256) #71

BurtHarris · 2018-04-25T01:59:48Z

Implement decoding stringref tags (tag 25 and 256) per http://cbor.schmorp.de/stringref

Stringref tags can be used to substantially compress CBOR streams containing repeated string values, such as map entry names. This change includes updates to the decoder so that streams encoded using stringref tags will be decoded correctly. I'll submit an encoding change separately.
Fixed test in cases.js which accidentally used tag 256 improperly. Changed this case to use the (currently unassigned) tag 1000.
Added tests in decoder.ava.js

To operate correctly this must be implemented in the decoder, because special nestable context is established by tag 256 for use by tag 25 during decoding. If no stringref-namespace (tag 256) is encountered in an stream, this change has no effect on decoding.

Ref #70

per <http://cbor.schmorp.de/stringref> To operate correctly this must be implemented in the decoder, because special nestable context is established by tag 256 for use by tag 25. Fixed test in cases.js which accidentially used tag 256 improperly. Added tests in decoder.ava.js

coveralls · 2018-04-25T02:02:47Z

Coverage increased (+0.005%) to 99.808% when pulling c8d3b58 on BurtHarris:stringref into f35349c on hildjj:master.

BurtHarris · 2018-04-25T02:35:54Z

I've now got the coverage % increased rather than decreased.

It was generating errors on some versions of Node.js. Probably promise related.

Add tests of invalid stringref inputs

BurtHarris · 2018-04-25T22:06:04Z

OK, this is ready for a code review. Decode support only (so far.)

per <http://cbor.schmorp.de/stringref> To operate correctly this must be implemented in the decoder, because special nestable context is established by tag 256 for use by tag 25. Fixed test in cases.js which accidentially used tag 256 improperly. Added tests in decoder.ava.js

It was generating errors on some versions of Node.js. Probably promise related.

Add tests of invalid stringref inputs

…nto stringref

hildjj · 2018-04-29T15:56:40Z

OK, I've read the spec you linked to, and the parts I understand I disagree with almost completely:

(Almost) all strings take up encoding space, except for ones that are too short. That approach is fraught with edge case that will lead to interop problems.
Even value strings that are unlikely to be repeated take up encoding space.
They nest. Let's learn our lesson from XML namespaces that we rarely need that complexity.
That spec needs a bunch more examples with more annotations to make it quicker to grasp

My alternate proposal (which I did code up enough to prove that it works, at one point):

25(["foo", 0]) means "foo", but defines atom 0 to mean "foo"
25(0) means "foo", as a reference to the atom defined above

as with your suggested approach, you need two passes through the input to calculate the best compression, but that's straightforward:

keep map of string -> count
drop all strings with count < 4 (ish. whatever the overhead of defining the atom is)
sort by string.length * count
assign each an atom number from the series (0, 1, -1, 2, -2, ...) (note: (1-(2*n+1)*Math.pow(-1, n)) / 4). If the amount of compression doesn't warrant an atom at the point it is assigned, skip it if you like, but it doesn't hurt anything if it's assigned.
(optional) save atom mapping for next time, so you can skip this step with similar data
take real encoding pass. The first time you see a string that has an atom, encode it as an array. Subsequent times, encode it as an integer.

All of the complexity is in the encoder, and that complexity can be amortized over multiple runs if you like.

hildjj · 2018-04-29T15:58:09Z

However, if multiple people want this (thumbs-up the original PR comment to indicate), I'll merge it anyway.

BurtHarris · 2018-04-30T00:25:47Z

@hildjj I'm not the originator of the tag 25/256 spec, but it's been an acknowledged extension of CBOR since RFC 7049 was published. Like you, I could suggest improvements to it, but thought I'd start with an implementation based on that existing tag 25/256 spec, rather than try to overload tag 25 with multiple meanings.

Annother alternate approach, using tag 28/29 on strings does something roughly equivalent to the alternate you propose. Plus it would allow smart one-pass encoding by tagging only those strings which are already interned by JavaScript. (This would cover most object fields since any string appearing in JavaScript code automatically gets interned.) Tag 28/29 is also is also how you would address the non-hierarchal data problems (objects references and cycles), and the decode logic could be shared, but good encoding of references/cycles is a more complex problem than interned strings, shared objects will generally require two passes.

Both tag 25/256 and 28/29 are a bit different from other tags in that they really need to be built into to core encoder/decoder logic. Neither seems perfect, but some experience implementing and using them might lead to a better alternative than either. But all that seems like subject for a separate design after experimenting some with the existing ones that Mark Lehmann already drew up.

Dynamic space-optimized CBOR would probably be best addressed by a gzip postprocessing, but the impact on complexity and speed isn't that attractive to me right now, and even with gzip the value/string sharing would probably still improve compression.

BurtHarris · 2018-04-30T03:53:54Z

Till you make up your mind about this, I'm going to postpone implementing the encoder changes. Let me know what you think about the tag 28/29 concept.

hildjj · 2022-09-13T15:05:40Z

If you still want this, the best thing to do is to create a plugin package like cbor-bigdecimal. I'll even take it as a patch to this repo so you don't have to publish and maintain it separately.

I'm going to close this PR, however, since that changeset will be pretty drastically different than this one. Please submit a new PR with the plugin approach if you like.

Burt Harris added 2 commits April 24, 2018 19:14

Improve testing coverage of utils.minCborLength

c63c774

Improve coverage on invalid stringref case

7e4e38e

Burt Harris added 3 commits April 24, 2018 20:09

Added more handcrafted corner case error tests

5aed2b9

Revert previous test

32a008a

It was generating errors on some versions of Node.js. Probably promise related.

Add myself to contributors list

809f3e7

Add tests of invalid stringref inputs

Burt Harris added 7 commits April 26, 2018 11:42

Improve testing coverage of utils.minCborLength

ace42c2

Improve coverage on invalid stringref case

f5e4ebd

Added more handcrafted corner case error tests

1c132bb

Revert previous test

b608bfa

It was generating errors on some versions of Node.js. Probably promise related.

Add myself to contributors list

39fb21e

Add tests of invalid stringref inputs

Merge branch 'stringref' of https://github.com/BurtHarris/node-cbor i…

c8d3b58

…nto stringref

Base automatically changed from master to main February 22, 2021 07:11

hildjj closed this Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement decoding of stringref (tag 25 & 256) #71

Implement decoding of stringref (tag 25 & 256) #71

BurtHarris commented Apr 25, 2018 •

edited

Loading

coveralls commented Apr 25, 2018 •

edited

Loading

BurtHarris commented Apr 25, 2018

BurtHarris commented Apr 25, 2018

hildjj commented Apr 29, 2018

hildjj commented Apr 29, 2018

BurtHarris commented Apr 30, 2018 •

edited

Loading

BurtHarris commented Apr 30, 2018

hildjj commented Sep 13, 2022

Implement decoding of stringref (tag 25 & 256) #71

Implement decoding of stringref (tag 25 & 256) #71

Conversation

BurtHarris commented Apr 25, 2018 • edited Loading

coveralls commented Apr 25, 2018 • edited Loading

BurtHarris commented Apr 25, 2018

BurtHarris commented Apr 25, 2018

hildjj commented Apr 29, 2018

hildjj commented Apr 29, 2018

BurtHarris commented Apr 30, 2018 • edited Loading

BurtHarris commented Apr 30, 2018

hildjj commented Sep 13, 2022

BurtHarris commented Apr 25, 2018 •

edited

Loading

coveralls commented Apr 25, 2018 •

edited

Loading

BurtHarris commented Apr 30, 2018 •

edited

Loading