Consider new format for passing data in #6696

benmccann · 2019-11-05T19:37:05Z

Splitting off this discussion from #6576 where it started.

A few goals:

Be able to skip parsing if the user passes in data in the desired format. E.g. if the user passes in data as numbers (not string or moment) then we should be able to use that data directly. This currently takes about 7% of the time on the uPlot benchmark, but should be able to be skipped
Make determineDataLimits (and getAllTimestamps for the time scale) faster on the x axis. We should just be able to look at the first and last data point. Right now we have to spend substantial time going through each data point of each data series looking for the min and max. This currently takes about 20% of the time on the uPlot benchmark, but should be instantaneous

We could use an array of objects. This option is a bit more self-describing and better handles sparse inputs.

data: [
 {x: 1572981786, y1: 12, y2: 33, o: 23.04, h: 24.01, l: 22.08, c: 23.01},
 {x: 1572591743, y1: 33, y2: 22, o: 23.04, h: 24.01, l: 22.08, c: 23.01},
 {x: 1572161732, y1: 11, y2: 11, o: 23.04, h: 24.01, l: 22.08, c: 23.01}
],
series: [
  {
    type: 'line',
    label: 'foo',
    data: { y: 'y1' },
    order: 10
  }, 
  {
    type: 'bar',
    label: 'bar',
    data: { y: 'y2' },
    order: 5
  },
  {
    type: 'financial', // not specifying data here because using the controller default names
    label: 'voo',
    order: 3
  }
],
scales: {
  x: {
    type: 'time',
    parser: 'YYYY-MM-DD'
  },
  y: {
    type: 'linear'
  }
}

Or an array of arrays. This option would be smaller data transfer over the wire. It also would avoid the user having to care about names (e.g. keeping track of an incrementer to create 'y1', 'y2', etc. and possibly parsing them back out)

data: [
 [1572981786, 12, 33, 23.04, 24.01, 22.08, 23.01],
 [1572591743, 33, 22, 23.04, 24.01, 22.08, 23.01],
 [1572161732, 11, 11, 23.04, 24.01, 22.08, 23.01]
],
series: [
  {
    type: 'line', // not specifying data here because defaulting to x: 0 and y: 1
    label: 'foo',
    order: 10
  }, 
  {
    type: 'bar',
    label: 'bar',
    data: { y: 2 },
    order: 5
  }
  financial: {
    type: 'financial',
    label: 'voo',
    data: { o: 3, h: 4, l: 5, c:6 },
    order: 3
  }
],
scales: {
  x: {
    type: 'time',
    parser: 'YYYY-MM-DD'
  },
  y: {
    type: 'linear'
  }
}

I was checking out a couple other libraries to see what they do. HighCharts, GoogleCharts, and uPlot (the names that were at the top of my head) all appear from my cursory glance to want data as an array of arrays and have parsing completely separated from controllers and scales as a step that happens before chart creation. HighCharts and GoogleCharts both give tools to manage parsing and other data manipulations.

The text was updated successfully, but these errors were encountered:

leeoniya · 2019-11-05T23:34:44Z

all appear from my cursory glance to want data as an array of arrays

it's worth noting that uPlot's data format is columnar rather than record/row-based so that it does not require allocating 1e6 tiny arrays for 1e6 datapoints, and avoids duplicating x values for each series by requiring them all to be x-coalesced. it can support a dense format like this because it's strictly a line plotter. if it had to support something like scatter plots, i'm not sure the format could retain its current density/efficiency.

EDIT: i'd probably use a similarly dense array for scatter, with a max of 5 arrays per series:

[
  [
    [x1,x2,x3,x4],
    [y1,y2,y3,y4],
    [v1,v2,v3,v4],
    [label1,label2,label3,label4],
  ],
  [
    [x5,x6,x7],
    [y5,y6,y7],
    [v5,v6,v7],
    [label5,label6,label7],
  ]
]

kurkle · 2019-11-10T09:56:29Z

It might be more intuitive to use map (or something else) instead if data in series:

data: [
 {x: 1572981786, y1: 12, y2: 33, o: 23.04, h: 24.01, l: 22.08, c: 23.01},
 {x: 1572591743, y1: 33, y2: 22, o: 23.04, h: 24.01, l: 22.08, c: 23.01},
 {x: 1572161732, y1: 11, y2: 11, o: 23.04, h: 24.01, l: 22.08, c: 23.01}
],
series: [
  {
    type: 'line',
    label: 'foo',
    map: { y: 'y1' },
    order: 10
  }, 
  {
    type: 'bar',
    label: 'bar',
    map: { y: 'y2' },
    order: 5
  }
  financial: {
    type: 'financial', // not specifying data here because using the controller default mapping
    label: 'voo',
    order: 3
  }
],
scales: {
  x: {
    type: 'time',
    parser: 'YYYY-MM-DD'
  },
  y: {
    type: 'linear'
  }
}

benmccann · 2019-11-12T22:07:06Z

map is a bit generic. Maybe we could find a slightly more descriptive term?

Passing data in column-wise is a pretty interesting idea. I imagine that would also make better use of memory locality.

Passing in data at the controller / series level probably better supports the sparse scatter chart case. It would also better support the case where there are multiple x-axes (it's unclear to me if that's an actual usecase that would occur, but I suppose it's slightly more flexible)

const data = [
 [1572981786, 1572591743, 1572161732],
 [12, 33, 11],
 [33, 22, 11],
 [23.04, 23.04, 23.04],
 [22.08, 22.08, 22.08],
 [23.01, 23.01, 23.01],
];

new Chart({series: [
  {
    type: 'line',
    label: 'foo',
    data: [
      [1572947586, 1572594965],
      [42, 31],
    ],
    order: 10
  }, 
  {
    type: 'bar',
    label: 'bar',
    data: data,
    input: { y: 2 },
    order: 5
  },
  {
    type: 'financial',
    label: 'voo',
    data: data,
    input: { o: 3, h: 4, l: 5, c:6 },
    order: 3
  }
],
scales: {
  x: {
    type: 'time',
    parser: 'YYYY-MM-DD'
  },
  y: {
    type: 'linear'
  }
}});

benmccann · 2019-11-13T00:20:21Z

Another benefit of passing data in column-wise is that scales are much better able to utilize it. E.g. the time scale needs an array of the index values. This would be trivial if data were passed in column-wise

benmccann · 2019-11-14T21:51:35Z

Another thought that I had on this is that, now that parsing is confined solely to the controller, each controller could specify their own format. I think for the core library we could probably use one format exclusively, but if someone wanted to build a custom controller I think it would probably be possible for them to specify their own data format

kurkle · 2019-11-14T22:27:11Z

In many use cases, data comes from db, where is is usually in rows. Filling missing values can be annoying.
Plotting two series, one with 1 ms precision and other 1 month precision to same chart with columnar data would be a quite silly (assuming both are configured in same array of arrays).

So I don't quite like the columnar data spec for this kind of configuration diversity Chart.js already offers.

leeoniya · 2019-11-14T22:46:25Z

you don't technically need to go whole-hog-columnar like uPlot does. there's still a good amount of savings by doing columnar-per-series (as the scatter variant i pondered above: #6696 (comment)).

In many use cases, data comes from db, where is is usually in rows.

this is definitely true. but even if you have to spend a bit of time and mem to convert the data up-front, you could potentially release the original structure and have a significantly smaller retained mem footprint. however, this mainly has an out-sized effect on data-heavy line & time-series charts, with scatter a far second, and the rest a very distant third.

benmccann · 2019-11-14T23:05:03Z

I tend to think that keeping datasets at the series level would be easier for the user, independent of whether we choose to go columnar or object/rows. E.g. if you have a dataset with month precision and a dataset with year precision and want to plot them both, they're not necessarily in the same dataset. So to have a single dataset at the chart level you'd have to do some type of join operation (Google Charts actually provides a utility for doing joins).

If we do decide to go with per-controller datasets then you don't have to worry about filling missing values, etc. so there's not much downside to columnar datasets

kurkle · 2019-11-15T04:32:38Z

Actually we are kind of moving away from columnar already, but still support it:

labels: ['a','b','c'],
datasets: [{ data: [1,2,3]}, {data: [4,5,6]}]

is equivalent to:

data: [
['a','b','c']
[1,2,3]
[4,5,6]
]

@leeoniya consider a regression line on a typical uPlot. its not efficiently stored.

benmccann · 2019-11-15T04:52:45Z

The second example you just gave I think is much nicer from a user perspective because you can send a single data structure back from the server and don't need to deconstruct it and pass the input to the right places

I think the reason you're suggesting uPlot doesn't store regression lines efficiently is that it must specify data for every index, but really you need only two points to store a line. However, if the dataset is passed in at the series-level, so that you can still have multiple datasets then I think columnar formats do not have this drawback. For the regression you could just pass a single 2x2 array: [[x1,x2],[y1,y2]]

leeoniya · 2019-11-15T05:18:53Z

in my view, regression lines are neither data nor a series; they're more akin to annotations, so i would not personally store them alongside the actual data anyways - they'd be drawn separately if e.g. r2: "linear" was specified in the series' config.

uPlot actually bounced back and forth between storing the data inside vs outside the series config before settling on keeping it separate basically for the reason that @benmccann mentions - you can stream data from the server and pass it directly to some .setData() in the same format as provided initially without having to wrap it in a series config struct.

kurkle · 2019-11-15T07:16:28Z

you can stream data from the server and pass it directly to some .setData() in the same format as provided initially without having to wrap it in a series config struct.

I kind of agree.

I think @benmccann is getting in good direction in #6696 (comment).

I'd extend a bit more (you can still use array and indices instead of object and properties):

const data = {
 time: [1572981786, 1572591743, 1572161732],
 time2: [1572981786, 1572161732],
 bar: [12, 33, 11],
 line: [1, 2],
 o: [33, 22, 11],
 h: [23.04, 23.04, 23.04],
 l: [22.08, 22.08, 22.08],
 c: [23.01, 23.01, 23.01],
 scatter: [{x: 10, y: 20}, {x: 20, y: -2}]
};

new Chart({
  data: data,
  series: [
  {
    type: 'line',
    data: [
      [1572947586, 1572594965],
      [42, 31],
    ],
    input: { x: 0, y: 1}
  }, 
  {
    type: 'line',
    data: [
      x: [1572947586, 1572594965],
      y: [42, 31],
    ],
  }, 
  {
    type: 'line',
    data: [{x: 1572947586, y: 42}, { x: 1572594965, y: 31}],
  }, 
  {
    type: 'bar',
    input: { x: 'time', y: 'bar' },
  },
  {
    type: 'bar',
    input: { x: 'time2', y: 'line' },
  },
  {
    type: 'bar',
    data: [1, 2, 3],
    axes: { x: 'cat' }
  },
  {
    type: 'bar',
    data: { x: ['before', 'in between', after'], y: [ 5, 4, 3]},
    axes: { x: 'cat' }
  },
  {
    type: 'scatter',
    input: 'scatter'
  }
  {
    type: 'financial',
    data: data, 
    input: { x: 'time' }, // o, h, c and l would be defaults
  }
],
scales: {
  x: {
    type: 'time',
  },
  y: {
    type: 'linear'
  },
  cat: {
    type: 'category',
    labels: ['a', 'b', 'c'],
    position: 'top'
  },
  ord: {
    type: 'ordinal',
    position: 'bottom'
  }
}});

benmccann · 2019-11-15T17:28:15Z

@kurkle I like your suggestion. I'm not sure I've thought through all the edge cases yet, but my initial reaction is that it should work well.

If we accept input in multiple formats, we'll still have to store it internally in a single format after parsing. What do you think about using columnar storage internally?

kurkle · 2019-11-15T17:59:02Z

It really does not make sense to me, but I might be biased. Would have to make a draft to work out the pros and cons.

etimberg · 2019-11-15T18:26:49Z

One thing I like about the current format is that it is very easy to understand. Perhaps we can do an internal transformation at the start of create / update. We could parse out min, max, etc and keep it as meta data. That would make the internal uses easier.

As an aside, I don't entirely follow what the o, l, h, and c properties are in the examples above

kurkle · 2019-11-15T18:33:29Z

open, high, low and close (ohlc)

benmccann · 2019-11-18T05:59:14Z

After playing around a bit and thinking about this a little more, I tend to think that storing the data column-wise internally wouldn't be much of an advantage.

benmccann · 2019-11-18T06:07:31Z

I do wonder though whether we should store each row as an array or an object

Arrays would be smaller to transfer over the wire. I expect they'd also take less memory

Surprisingly when I did a small benchmark for access only it seemed like an objects won out: https://jsperf.com/chartjs-internal-storage I wouldn't have expected that to be the case

benmccann · 2019-12-14T18:55:09Z

I'm closing this since we seem to be in a pretty good place now in terms of performance

benmccann added type: enhancement breaking change labels Nov 5, 2019

benmccann added this to the Version 3.0 milestone Nov 5, 2019

etimberg mentioned this issue Nov 5, 2019

State of Chart.js & V3.0 #6598

Closed

benmccann mentioned this issue Nov 7, 2019

determineDataLimits optimizations #6695

Merged

This was referenced Nov 18, 2019

Option to specify dataset has been normalized #6765

Closed

Store parsed data more similarly to provided data #6766

Closed

benmccann closed this as completed Dec 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider new format for passing data in #6696

Consider new format for passing data in #6696

benmccann commented Nov 5, 2019 •

edited

Loading

leeoniya commented Nov 5, 2019 •

edited

Loading

kurkle commented Nov 10, 2019

benmccann commented Nov 12, 2019 •

edited

Loading

benmccann commented Nov 13, 2019

benmccann commented Nov 14, 2019

kurkle commented Nov 14, 2019

leeoniya commented Nov 14, 2019

benmccann commented Nov 14, 2019 •

edited

Loading

kurkle commented Nov 15, 2019

benmccann commented Nov 15, 2019

leeoniya commented Nov 15, 2019 •

edited

Loading

kurkle commented Nov 15, 2019

benmccann commented Nov 15, 2019

kurkle commented Nov 15, 2019

etimberg commented Nov 15, 2019

kurkle commented Nov 15, 2019

benmccann commented Nov 18, 2019

benmccann commented Nov 18, 2019

benmccann commented Dec 14, 2019

Consider new format for passing data in #6696

Consider new format for passing data in #6696

Comments

benmccann commented Nov 5, 2019 • edited Loading

leeoniya commented Nov 5, 2019 • edited Loading

kurkle commented Nov 10, 2019

benmccann commented Nov 12, 2019 • edited Loading

benmccann commented Nov 13, 2019

benmccann commented Nov 14, 2019

kurkle commented Nov 14, 2019

leeoniya commented Nov 14, 2019

benmccann commented Nov 14, 2019 • edited Loading

kurkle commented Nov 15, 2019

benmccann commented Nov 15, 2019

leeoniya commented Nov 15, 2019 • edited Loading

kurkle commented Nov 15, 2019

benmccann commented Nov 15, 2019

kurkle commented Nov 15, 2019

etimberg commented Nov 15, 2019

kurkle commented Nov 15, 2019

benmccann commented Nov 18, 2019

benmccann commented Nov 18, 2019

benmccann commented Dec 14, 2019

benmccann commented Nov 5, 2019 •

edited

Loading

leeoniya commented Nov 5, 2019 •

edited

Loading

benmccann commented Nov 12, 2019 •

edited

Loading

benmccann commented Nov 14, 2019 •

edited

Loading

leeoniya commented Nov 15, 2019 •

edited

Loading