-
Notifications
You must be signed in to change notification settings - Fork 11.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider new format for passing data in #6696
Comments
it's worth noting that uPlot's data format is columnar rather than record/row-based so that it does not require allocating 1e6 tiny arrays for 1e6 datapoints, and avoids duplicating x values for each series by requiring them all to be x-coalesced. it can support a dense format like this because it's strictly a line plotter. if it had to support something like scatter plots, i'm not sure the format could retain its current density/efficiency. EDIT: i'd probably use a similarly dense array for scatter, with a max of 5 arrays per series: [
[
[x1,x2,x3,x4],
[y1,y2,y3,y4],
[v1,v2,v3,v4],
[label1,label2,label3,label4],
],
[
[x5,x6,x7],
[y5,y6,y7],
[v5,v6,v7],
[label5,label6,label7],
]
] |
It might be more intuitive to use data: [
{x: 1572981786, y1: 12, y2: 33, o: 23.04, h: 24.01, l: 22.08, c: 23.01},
{x: 1572591743, y1: 33, y2: 22, o: 23.04, h: 24.01, l: 22.08, c: 23.01},
{x: 1572161732, y1: 11, y2: 11, o: 23.04, h: 24.01, l: 22.08, c: 23.01}
],
series: [
{
type: 'line',
label: 'foo',
map: { y: 'y1' },
order: 10
},
{
type: 'bar',
label: 'bar',
map: { y: 'y2' },
order: 5
}
financial: {
type: 'financial', // not specifying data here because using the controller default mapping
label: 'voo',
order: 3
}
],
scales: {
x: {
type: 'time',
parser: 'YYYY-MM-DD'
},
y: {
type: 'linear'
}
} |
Passing data in column-wise is a pretty interesting idea. I imagine that would also make better use of memory locality. Passing in data at the controller / series level probably better supports the sparse scatter chart case. It would also better support the case where there are multiple x-axes (it's unclear to me if that's an actual usecase that would occur, but I suppose it's slightly more flexible)
|
Another benefit of passing data in column-wise is that scales are much better able to utilize it. E.g. the time scale needs an array of the index values. This would be trivial if data were passed in column-wise |
Another thought that I had on this is that, now that parsing is confined solely to the controller, each controller could specify their own format. I think for the core library we could probably use one format exclusively, but if someone wanted to build a custom controller I think it would probably be possible for them to specify their own data format |
In many use cases, data comes from db, where is is usually in rows. Filling missing values can be annoying. So I don't quite like the columnar data spec for this kind of configuration diversity Chart.js already offers. |
you don't technically need to go whole-hog-columnar like uPlot does. there's still a good amount of savings by doing columnar-per-series (as the scatter variant i pondered above: #6696 (comment)).
this is definitely true. but even if you have to spend a bit of time and mem to convert the data up-front, you could potentially release the original structure and have a significantly smaller retained mem footprint. however, this mainly has an out-sized effect on data-heavy line & time-series charts, with scatter a far second, and the rest a very distant third. |
I tend to think that keeping datasets at the series level would be easier for the user, independent of whether we choose to go columnar or object/rows. E.g. if you have a dataset with month precision and a dataset with year precision and want to plot them both, they're not necessarily in the same dataset. So to have a single dataset at the chart level you'd have to do some type of join operation (Google Charts actually provides a utility for doing joins). If we do decide to go with per-controller datasets then you don't have to worry about filling missing values, etc. so there's not much downside to columnar datasets |
Actually we are kind of moving away from columnar already, but still support it: labels: ['a','b','c'],
datasets: [{ data: [1,2,3]}, {data: [4,5,6]}] is equivalent to: data: [
['a','b','c']
[1,2,3]
[4,5,6]
] @leeoniya consider a regression line on a typical uPlot. its not efficiently stored. |
The second example you just gave I think is much nicer from a user perspective because you can send a single data structure back from the server and don't need to deconstruct it and pass the input to the right places I think the reason you're suggesting uPlot doesn't store regression lines efficiently is that it must specify data for every index, but really you need only two points to store a line. However, if the dataset is passed in at the series-level, so that you can still have multiple datasets then I think columnar formats do not have this drawback. For the regression you could just pass a single 2x2 array: |
in my view, regression lines are neither data nor a series; they're more akin to annotations, so i would not personally store them alongside the actual data anyways - they'd be drawn separately if e.g. uPlot actually bounced back and forth between storing the data inside vs outside the series config before settling on keeping it separate basically for the reason that @benmccann mentions - you can stream data from the server and pass it directly to some |
I kind of agree. I think @benmccann is getting in good direction in #6696 (comment). I'd extend a bit more (you can still use array and indices instead of object and properties): const data = {
time: [1572981786, 1572591743, 1572161732],
time2: [1572981786, 1572161732],
bar: [12, 33, 11],
line: [1, 2],
o: [33, 22, 11],
h: [23.04, 23.04, 23.04],
l: [22.08, 22.08, 22.08],
c: [23.01, 23.01, 23.01],
scatter: [{x: 10, y: 20}, {x: 20, y: -2}]
};
new Chart({
data: data,
series: [
{
type: 'line',
data: [
[1572947586, 1572594965],
[42, 31],
],
input: { x: 0, y: 1}
},
{
type: 'line',
data: [
x: [1572947586, 1572594965],
y: [42, 31],
],
},
{
type: 'line',
data: [{x: 1572947586, y: 42}, { x: 1572594965, y: 31}],
},
{
type: 'bar',
input: { x: 'time', y: 'bar' },
},
{
type: 'bar',
input: { x: 'time2', y: 'line' },
},
{
type: 'bar',
data: [1, 2, 3],
axes: { x: 'cat' }
},
{
type: 'bar',
data: { x: ['before', 'in between', after'], y: [ 5, 4, 3]},
axes: { x: 'cat' }
},
{
type: 'scatter',
input: 'scatter'
}
{
type: 'financial',
data: data,
input: { x: 'time' }, // o, h, c and l would be defaults
}
],
scales: {
x: {
type: 'time',
},
y: {
type: 'linear'
},
cat: {
type: 'category',
labels: ['a', 'b', 'c'],
position: 'top'
},
ord: {
type: 'ordinal',
position: 'bottom'
}
}}); |
@kurkle I like your suggestion. I'm not sure I've thought through all the edge cases yet, but my initial reaction is that it should work well. If we accept input in multiple formats, we'll still have to store it internally in a single format after parsing. What do you think about using columnar storage internally? |
It really does not make sense to me, but I might be biased. Would have to make a draft to work out the pros and cons. |
One thing I like about the current format is that it is very easy to understand. Perhaps we can do an internal transformation at the start of create / update. We could parse out min, max, etc and keep it as meta data. That would make the internal uses easier. As an aside, I don't entirely follow what the |
open, high, low and close (ohlc) |
After playing around a bit and thinking about this a little more, I tend to think that storing the data column-wise internally wouldn't be much of an advantage. |
I do wonder though whether we should store each row as an array or an object Arrays would be smaller to transfer over the wire. I expect they'd also take less memory Surprisingly when I did a small benchmark for access only it seemed like an objects won out: https://jsperf.com/chartjs-internal-storage I wouldn't have expected that to be the case |
I'm closing this since we seem to be in a pretty good place now in terms of performance |
Splitting off this discussion from #6576 where it started.
A few goals:
string
ormoment
) then we should be able to use that data directly. This currently takes about 7% of the time on the uPlot benchmark, but should be able to be skippeddetermineDataLimits
(andgetAllTimestamps
for the time scale) faster on the x axis. We should just be able to look at the first and last data point. Right now we have to spend substantial time going through each data point of each data series looking for themin
andmax
. This currently takes about 20% of the time on the uPlot benchmark, but should be instantaneousWe could use an array of objects. This option is a bit more self-describing and better handles sparse inputs.
Or an array of arrays. This option would be smaller data transfer over the wire. It also would avoid the user having to care about names (e.g. keeping track of an incrementer to create
'y1'
,'y2'
, etc. and possibly parsing them back out)I was checking out a couple other libraries to see what they do. HighCharts, GoogleCharts, and uPlot (the names that were at the top of my head) all appear from my cursory glance to want data as an array of arrays and have parsing completely separated from controllers and scales as a step that happens before chart creation. HighCharts and GoogleCharts both give tools to manage parsing and other data manipulations.
The text was updated successfully, but these errors were encountered: