Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

size of serialized DOM #151

Closed
eoghanmurray opened this issue Nov 25, 2019 · 37 comments
Closed

size of serialized DOM #151

eoghanmurray opened this issue Nov 25, 2019 · 37 comments

Comments

@eoghanmurray
Copy link
Contributor

I'm seeing 10x character size of the serialization of the initial DOM state (EventType.FullSnapshot) compared with a plain HTML representation of the same thing.
Is minimizing the size of this on the agenda as a design goal?

I'm thinking that it could be reduced as follows:

  • simple things like renaming attributes to attrs
  • not storing empty childNodes/attributes lists/objects (making them implicit)
  • removing type: 2 (type: NodeType.Element) and similar, as that can be inferred from presence of childNodes
  • only setting isSVG/isStyle boolean attributes if they are unusual (i.e. True)

Are there any strong reasons not to do any of the above?

@IMFIL
Copy link
Contributor

IMFIL commented Nov 26, 2019

Reducing the size of the serialized DOM would be great. Did you want to create a PR ?

@bingjie3216
Copy link

+1 on this proposal. Besides, what we can do is:
Provide the option of using relative path for the styles and css instead of downloading the whole content. I feel like the css/styles have taken too much space.

@IMFIL
Copy link
Contributor

IMFIL commented Nov 27, 2019

@bingjie3216 There's an option to keep absolute css paths within the html. Even with this option on the DOM size is sizeable.

@Yuyz0112
Copy link
Member

@eoghanmurray @IMFIL @bingjie3216 Thanks for the feedback!

I believe there is a huge potential to reduce the size of the recorded events(I always seeing 90% size reduction when I gzip the events).
But keep the data structure explicit is also very important.

So the works for reducing size may contain the following part:

  1. Build a sizer tool, which can show the distribution of events size. This helps:
    1.1 Find the bottleneck of size for any specific situation.
    1.2 Check how will the record options affect the size.
    1.3 Check how will the pack strategies affect the size.
  2. Provide some pack/unpack strategies, which should be pluggable because it may introduce overhead.

Some pack/unpack strategies I know including:

  • The way @eoghanmurray suggested or similar, which is hand-made and rrweb specific.
  • MessagePack. It's like JSON, but fast and small.
  • pako. high-speed zlib port to javascript.

@Yuyz0112
Copy link
Member

Yuyz0112 commented Jan 5, 2020

Sorry for the later.
After finishing a lot of works last month, finally, I've got time to start working on rrweb again!

I think this issue is the most important one in the current stage, and I would like to provide a solution int the next major release.

With the ideas that I illustrated above, I have done some POC code in this repo.

Currently, I have implemented a analyze framework and several packers:

  1. simple packer. Following @eoghanmurray's comments, this packer makes the keys shorter and omits some keys which can be inferred by the data structure.
  2. msgpack packer. Use msgpack-javascript to encode and decode events.
  3. pako packer. Use pako to deflate and inflate events.

Now the msgpack packer is not working as intend and I'm still checking my implementation. The other two shows some good result when testing on two real-world events log.

I'm using two real-world events log to benchmark the packers:

  • e1: An events log with a big full snapshot.
  • e2: An events log with many incremental snapshots created by a table-like UI, which means the DOMs are similar to others.

===

simple

e1

"packedSize": 1870789,
"size":       2115468,

e2

"packedSize": 6023940,
"size":       10457884,

pako

e1

"packedSize": 1093306,
"size":       2115468,

e2

"packedSize": 1435585,
"size":       10457884,

@Yuyz0112 Yuyz0112 mentioned this issue Feb 18, 2020
@alexcroox
Copy link

Pako seems like quite a big dependancy (63% of the size of rrweb itself if npm is to be believed). Are there any lighter weight alternatives?

I bring it up, as the bigger this library becomes, the less attractive it is to bundle to end users.

@Yuyz0112
Copy link
Member

@ChuckJonas Thanks for the feedback.

I believe there are several important aspects when designing the packer plugin.

Efficiency

With some experiments mentioned above or not, it seems some Zlib-like compressing algorithms are most efficient in processing rrweb's events.

Data compressing is not the only way to reduce the size of events. User can still do things like:

  1. Disable inline stylesheet if the original link will be available when replaying.
  2. Use .rrweb-block class name to block the area they do not care about.
  3. Disable mouse movement records if they are not needed.

But data compressing is the simplest and the most versatile way to do this. Users can add one-line code and see up to 90% reduction of events size.

Although there is some trade-off between efficiency and other aspects, it is still the most important one since it affects MiB~TiB level data.

Browser runtime size

One of rrweb's advantage is its minimal runtime size(5.9 KiB gzipped size for the recorder).

So if we decide to add a packer to rrweb, we will:

  1. Add it like a plugin, which means users can decide whether to load/bundle the packer, it's all on-demand.
  2. Provide a minimal bundle for the packer. For example, end-users only need to load pako's inflate code, which is 8.1 KiB gzipped, while the whole pako's gzipped size is 14.5 KiB.

Besides that, users can still choose to run the packing process in their server, instead of in end-users' browsers(think about the reverse of edge computing).
The sample code looks like this:

import { pack } from 'rrweb'

server.post('/events', (req, res) => {
  const packedData = pack(req.body)
  saveDataToDB(packedData)
  res.send('Ok')
})

In this way, the trade-off is end-users will not load the pack plugin bundle, but will still have a relative high transfer data size and your server will become a centralize packing factory.

Simplicity

I prefer to provide an easy-to-use API for rrweb users, which means they can decide to pack or not with a simple boolean flag.

A table of several packer plugin choices along with their trade-off on bundle size, efficiency, CPU costing, etc is not the thing I would like to ship in rrweb.

@alexcroox
Copy link

Appreciate the detailed response thank you, the planned plugin nature of it alleviates any concerns.

@jpwiddy
Copy link

jpwiddy commented Feb 19, 2020

I like the idea of packing serverside to offload the work, but also like the idea of doing it on the client to just...speed up the transfer of events/ideally slim down the network traffic size. Tradeoffs, for sure. Probably will implement on the server personally, but would generally love documentation around this concept.

@shmilyoo
Copy link

I have a question.
The actual front-to-back content size should be calculated like this:

const packedString = deflate(JSON.stringify(events), { to: 'string' });
console.log('transfer size is: ', new Uint8Array(Buffer.from(packedString)).length)

compare to the original not packed object sent to backend , the compress rate is about 0.1 - 0.5

@eoghanmurray
Copy link
Contributor Author

the trade-off is end-users will not load the pack plugin bundle, but will still have a relative high transfer data size and your server will become a centralize packing factory.

Just a reminder that my original proposal related to being a bit more careful/efficient in the JSON format itself. Reducing the repetitive aspects of the original JSON would provide advantages in transmission as well preempt much of the need for zipping either client side or server side.

But keep the data structure explicit is also very important.
Then why are numeric codes used instead of strings e.g. 8 instead of 'TouchMove_Departed'??
(IMO these would actually be easier to work with if they were fully expanded)

Here's a quick analysis of a sample JSON DOM structure showing repetitive keys:

{ type: 560 childNodes: 218 name: 1 publicId: 1 systemId: 1 id: 560 tagName: 217 attributes: 217 textContent: 341 isStyle: 1 }

And here's the empty nodes e.g. { ... attributes: {}, ... }:
{attributes: 79, childNodes: 47}

(Here's the code I executed at the console to come up with these figures:

var counts = {};
var empty_counts = {};
var count_nodes = function(n) {
    for (var k in n) {
	if (counts[k] === undefined) {
	    counts[k] = 1;
	} else {
	    counts[k] += 1;	
	}
	if (typeof(n[k]) == 'object' && keys(n[k]).length == 0){
	    if (empty_counts[k] === undefined) {
		empty_counts[k] = 1;
	    } else {
	     	empty_counts[k] += 1;	
	    }	     
	}
	if (Array.isArray(n[k])) {
	    for (var i=0; i<n[k].length; i++) {
		count_nodes(n[k][i]);
	    }
	}
    }
}
count_nodes(e.data.node);

)

So by e.g. abbreviating attributes -> a, textContent -> t, tagName -> n, childNodes -> c
you'd effectively be doing a lot of what I imagine gzip is doing 'for free', and I don't think it will be any less legible to someone browsing the structure as you'd usually be able to infer the meaning from the context (the value).

This could be done in a backwards compatible way so that it's still possible to playback non-abbreviated content.

@Yuyz0112
Copy link
Member

Yuyz0112 commented Mar 1, 2020

@eoghanmurray

Yes, shorten the JSON keys will help in some cases. But we are also seeing some size issues in the case of:

  1. animation cause attributes change
  2. long list create/destroy cause a lot of DOM changes

Considering a situation like this:

  1. visit a page with a complex table, maybe 50 rows, with fixed header columns and rich content cell
  2. go to a page with some SVG based charts
  3. go back to the first table page

rrweb will collect a lot of data in the process, which can be greatly compressed by gzip(because the data are quite similar, e.g, every row of the table).

So I think to introduce pako is a more general solution. But I'm very open to the packer plugin system, I suggest anyone can build a compatible ad-hoc packer plugin based on its interface.

@eoghanmurray
Copy link
Contributor Author

Cool; for my use case I'll be sending events over a websocket connection as they happen; so I was only seeing compression in terms of compressing single events at a time; in particular the event which has the initial DOM tree. Adding Pako or similar wouldn't be a runner as for my project the size of the .js deliverable is a big factor.

@Yuyz0112
Copy link
Member

Yuyz0112 commented Apr 5, 2020

For anyone who is interested in this issue, the packer plugin API has finally been stabilized.

The purposed API looks like this:

/**
 * Now you can import the official pack and unpack function from the rrweb package.
 * 
 * The pack and unpack code was implemented in separate modules. So bundlers
 * could tree-shaking them if you do not import, which means there should be
 * no bundle size difference when you are not going to use the packer feature.
 */
import { record, pack, unpack, Replayer } from 'rrweb';

/**
 * When recording, you just need to pass pack as the packFn property to the
 * record function.
 */
record({
  emit(event) {
    // event is the result returned by the pack function
  },
  packFn: pack
})

/**
 * When replaying, you just need to pass unpack as the unpackFn property to
 * the replayer.
 * 
 * The official unpack function has the compatibility to process both non-packed events
 * and packed events. This is strongly recommended if you are going to implement your
 * own packer.
 */
const player = new Replayer(events, {
  root: document.body,
  unpackFn: unpack
})
player.play()

/**
 * As we say 'official', it means you can also implement your own pack/unpack functions.
 * For example, you can pack the data by replacing the 'type' property name with a shorter
 * one like 't'.
 * 
 * Also you need to unpack the event to a valid rrweb event schema.
 */
function myPack(event) {
  event.t = event.type
  delete event.type
  return event
}

function myUnpack(event) {
  event.type = event.t
  delete event.t
  return event
}

I planned to merge the packer PR tomorrow, any feedback welcomed.

@MaheshCasiraghi
Copy link

@Yuyz0112 I believe this API proposal for the pack plugin looks great. Do you have committed it somewhere where I can test it out?

@Yuyz0112
Copy link
Member

Yuyz0112 commented Apr 7, 2020

@Yuyz0112 Yuyz0112 closed this as completed Apr 7, 2020
@Yuyz0112 Yuyz0112 unpinned this issue Apr 12, 2020
@eoghanmurray
Copy link
Contributor Author

Is there any documentation on how to disable pako?

It's being bundled now in my dist/rrweb.js (122kb before, 384kb after).

@Yuyz0112
Copy link
Member

@eoghanmurray Are you using some bundlers like webpack?
Or if you are using script tag to load code?

@eoghanmurray
Copy link
Contributor Author

We are concatenating in dist/rrweb.min.js to the deliverable with no further packing/bundling.
rrweb.min.js is 93K vs. 42K before.

@eoghanmurray
Copy link
Contributor Author

eoghanmurray commented Apr 14, 2020

I would have thought that if I wanted the packing capabilities I'd use the new dist/packer/rrweb-pack*.js versions?

@Yuyz0112
Copy link
Member

Since you are using dist/rrweb.min.js, do you need both record and replay features in the same time in your app?

@eoghanmurray
Copy link
Contributor Author

No, sorry to clarify, we're concatenating rrweb-record.min.js in the deliverable that goes out to the web, and using rrweb.min.js for playback (which doesn't need any recording capabilities actually, but file size is not so important at playback time).

@eoghanmurray
Copy link
Contributor Author

I forgot to mention that including the current dist/rrweb.js (when used for playback) gives a ReferenceError: pako_deflate is not defined on the last line of the file: }({},pako_deflate,pako_inflate));
So maybe I'm missing something important, but just wanted to check whether that's the expected compilation output?

@shmilyoo
Copy link

I have previously used pako for data compression and transferred to the node backend for decompression using pako. The compression and decompression functions used by pako are pako.deflate/ pako.inflate.
But when I refactored my node backend, written in golang, there were some issues that caused me to have to use pako.deflateraw for compression and to generate byte streams.
So I'm concerned that having RRWEB provide the pack function not only increases the pack size, but also limits the flexibility of the user to compress the data and has unpredictable problems when adapting to different language backends.

@Yuyz0112
Copy link
Member

@eoghanmurray @shmilyoo

I think there are two things could be explained here.

The bundle size

rrweb provides three kinds of module system bundle file as output: iife(dist/), commonjs(lib/), es-module(es/).

With the es-module bundle file, modern JS bundlers, like webpack and rollup, can do an optimization called tree-shaking during bundling. Tree-shaking will remove unused code from the final bundle.

So if people are using these bundlers with es-module rrweb, and do not use some packer features, the final bundle size will not increase.

A sample code looks like this:

// if you do not import pack and unpack functions, they will not be bundled into your final JS file.
import { record, Replayer } from 'rrweb'

record()

new Replayer()

How about users not using es module

If users are not using es module, especially for the users just use a script tag to load rrweb(which is also my favorite way), they could not benefit from tree-shaking.

So I provide some other bundle files for different use cases. For example, there is a bundle file called rrweb-record.js, which only has the record side code without any replayer code. This file is suitable for the usage of load on the end-users website for collecting events.

But when features growing, there are too many combinations of bundling. So currently I provide these combinations:

  1. rrweb-record.js, only contain the record code.
  2. rrweb-pack.js, only contain the pack code.
  3. rrweb.js, contains all the features.

If you want to bundle rrweb to your website_1 for collecting events and use your website_2 for replaying, you can load rrweb-record.js in website_1 and rrweb.js in website_2.

if you also want to pack events when recording, you can load additional script rrweb-pack.js in website_1.

So when I see @eoghanmurray 's question, I asked do you need both the record and replayer code on the same page and care about the bundle size of the page?

If you do not really need the replayer code on the same page, things are easy. You can just use rrweb-record.js which is small.

If you do, I think there are two options:

  1. rrweb provide a new kind of bundle file, called rrweb-core.js, which including record and replayer code, but no packer code.
  2. Use a bundler to handle this, which is the most flexible way and will tree-shaking no matter rrweb add how many new features.

Custom Pack function

@shmilyoo First I suggest reading the comments above about bundle size, so we will on the same page that it is possible to not increase bundle size when you are not loading the official packer plugin.

Furthermore, rrweb's plugin system provides the flexibility for using your own pack/unpack functions like this:

function myPack(event) {
  event.t = event.type
  delete event.type
  return event
}

function myUnpack(event) {
  event.type = event.t
  delete event.t
  return event
}

record({
  emit(event) {
    // event is the result returned by the pack function
  },
  packFn: myPack
})

const player = new Replayer(events, {
  root: document.body,
  unpackFn: myUnpack
})
player.play()

Do you think that solve your problems?

@Yuyz0112
Copy link
Member

@eoghanmurray

I'm not seeing the reference error. My demo: https://codepen.io/yuyz0112/pen/BaojoMd

@eoghanmurray
Copy link
Contributor Author

Ah sorry, my bad on that one — I did git fetch but neglected to do git fetch upstream so was compiling from 685951d instead of 0.7.33 :(

@eoghanmurray
Copy link
Contributor Author

  1. rrweb provide a new kind of bundle file, called rrweb-core.js, which including record and replayer code, but no packer code.

This would be great for my usecase anyhow. How about the idea from 6efc6d9 #158 where different ENV variables can be provided to turn on/off features for custom builds?

@eoghanmurray
Copy link
Contributor Author

(I'd also note that the terminology pack can mean a few different things, e.g. source file minification; would dist/packer/rrweb-pack.js be better if it lived in dist/record/ and was called e.g. rrweb-record-and-compress.js or similar?)

@eoghanmurray
Copy link
Contributor Author

Also, for someone using rrweb.js for the replayer, but needing the decompression; would it be cleaner if they were required to separately include e.g. https://cdn.jsdelivr.net/npm/[email protected]/dist/pako.js ?

@Yuyz0112
Copy link
Member

@eoghanmurray

I have done some similar things to the rollup config(https://github.com/rrweb-io/rrweb/blob/master/rollup.config.js#L30-L46). Now add another bundle mode is much easier.

Also, for someone using rrweb.js for the replayer, but needing the decompression; would it be cleaner if they were required to separately include e.g. https://cdn.jsdelivr.net/npm/[email protected]/dist/pako.js?

I don't think to expose pako to rrweb users is a great idea, because we may not stick to pako forever and different usage on the pako may cause some problems which are hard to debug. So a simple wrapper on the pako interface may be more stable for rrweb users.

@eoghanmurray
Copy link
Contributor Author

we may not stick to pako forever

hmm, now that you are including pako in the core, I imagine that recordings will begun to be created with it.

  1. rrweb.js, contains all the features.

If this is the goal, then the main rrweb.js dist file will always need to include pako if it is to be able to replay all types of recordings!

@eoghanmurray
Copy link
Contributor Author

eoghanmurray commented Apr 15, 2020

I have done some similar things to the rollup config

I tried yesterday to create a dist/rrweb.js output that omitted pako, but was not able to do it after a few hours (getting compilation errors as e.g. typings/index.d.ts trying to import things from src/packer/* which I was trying to expunge). Is this something that is possible to do by only changing package.json & rollup.config.js?

@Yuyz0112
Copy link
Member

we may not stick to pako forever

hmm, now that you are including pako in the core, I imagine that recordings will begun to be created with it.

  1. rrweb.js, contains all the features.

If this is the goal, then the main rrweb.js dist file will always need to include pako if it is to be able to replay all types of recordings!

There are two choices:

  1. rrweb.js with all the features and rrweb-core.js with only record and replay.
  2. rrweb.js with only record and replay, while rrweb-packer.js with packer plugin.

So the second one sounds better?

@shmilyoo
Copy link

shmilyoo commented Apr 15, 2020

@Yuyz0112 Using custom pack function is ok, absolutely.
My project alrealy use an outer pako library , so I need not to change the code.
Good job!

@eoghanmurray
Copy link
Contributor Author

eoghanmurray commented Apr 15, 2020

So the second one sounds better?

yes, for me anyhow; but I'd also be happy with an ENV switch which could remove pako from the dist output.

(also an ENV variable which could omit all recording functionality for a purely 'replayer' version would be super also now that I think of it!)

@Yuyz0112 Yuyz0112 mentioned this issue Apr 15, 2020
@Yuyz0112
Copy link
Member

@eoghanmurray

I've made a new PR for the bundle things: #199

I think we can move the further discussion to there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants