Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data source size #36

Closed
projecthate opened this issue Sep 14, 2013 · 33 comments
Closed

Data source size #36

projecthate opened this issue Sep 14, 2013 · 33 comments
Labels

Comments

@projecthate
Copy link

Hi, I'm having trouble implementing a rather large data source through the pivot table. The source is currently about 300,000 rows with 12 columns.

As I begin to increase the number of rows I'm selecting, it gets slower and slower, eventually not loading at all. I mean, the data is returned from the server, but the page is just blank. Here are some numbers:
1,000 1.12MB 581ms
10,000 2.4MB 939ms
15,000 3.6MB 1.46s
50,000 12.0MB 4.76s

This is # of records, size of response and response time in Chrome x64.

Anything after 15,000 does not render a chart or table at all.

I see someone posted in #29 a similar issue and he said that he resolved it by breaking his ajax requests apart. But I'm not really sure how to do this. I can't find anything in the documentation that lets data be loaded in piecemeal, after the initial instantiation of the table. Is this possible?

@nicolaskruchten
Copy link
Owner

I've definitely used PivotTable.js with up to 200k rows of input data, and it does slow down at that point, but certainly it's quite snappy in the 25k-50k rows of input. The time it takes to generate a table varies with the number of records and with the size of the table (i.e. the cardinality of the attributes you use as rows and columns): if you are making a table that is more than around 200x200 things will definitely start to take a while.

The issue in #29 was with actually loading the data up into memory before passing it into the pivotUI() call, I believe, and was fixed by batching the data across the wire. PivotTable.js doesn't support incrementally inputting data, as that wouldn't help much with performance under the current architecture.

What does your data look like and what sort of table are you trying to generate?

@projecthate
Copy link
Author

[{"ID": "1","Geography": "United States","Disease Area": "","Valu Version": "","Product": "","Type": "A. Population","Variable": "Population","Column": "Country","Value": "USA","Formula": "","For Calc": "","Is Protected": "Y","Format": ""},{"ID": "2","Geography": "United States","Disease Area": "","Valu Version": "","Product": "","Type": "A. Population","Variable": "Population","Column": "YoY Growth (%)","Value": "0.00975","Formula": "","For Calc": "","Is Protected": "N","Format": "0.000%"},{"ID": "3","Geography": "United States","Disease Area": "","Valu Version": "","Product": "","Type": "A. Population","Variable": "Population","Column": "2010","Value": "317640000","Formula": "","For Calc": "","Is Protected": "N","Format": "#,##0"},{"ID": "4","Geography": "United States","Disease Area": "","Valu Version": "","Product": "","Type": "A. Population","Variable": "Population","Column": "2011","Value": "320736990","Formula": "C15+(C15_B15)","For Calc": "","Is Protected": "Y","Format": "#,##0"},{"ID": "5","Geography": "United States","Disease Area": "","Valu Version": "","Product": "","Type": "A. Population","Variable": "Population","Column": "2012","Value": "323864175.6525","Formula": "D15+(D15_B15)","For Calc": "","Is Protected": "Y","Format": "#,##0"}]

All kinds of tables. When it works, the code I'm using lets me pick from a dropdown. List of values: Table, Table Barchart, Heatmap, Row Heatmap, Col Heatmap, Line Chart, Bar Chart, Area Chart

I understand it will take longer to render more data, but I've walked away for over five minutes on 20k and nothing has displayed. Yet the data was transferred within seconds...

@nicolaskruchten
Copy link
Owner

That's very strange, I can't replicate this behaviour on my end: it's working fine right now with 25k of data in my browser.

What OS/browser versions are you running?

If you look in the javascript debug console are you seeing any helpful output or error messages?

Is your data proprietary or can you send me a to-scale sample of what you're running into problems with?

@projecthate
Copy link
Author

Very proprietary.

Windows 7 x64 in latest versions of Firefox and chrome, tested on three
different machines. No errors, the network panel shows the data loads in
reasonable time, the pivot table just isn't rendered.
On Sep 15, 2013 3:32 PM, "Nicolas Kruchten" [email protected]
wrote:

That's very strange, I can't replicate this behaviour on my end: it's
working fine right now with 25k of data in my browser.

What OS/browser versions are you running?

If you look in the javascript debug console are you seeing any helpful
output or error messages?

Is your data proprietary or can you send me a to-scale sample of what
you're running into problems with?


Reply to this email directly or view it on GitHubhttps://github.com//issues/36#issuecomment-24478236
.

@nicolaskruchten
Copy link
Owner

Hi again,

OK, so I've generated a random file to test with, it's here: https://www.dropbox.com/s/05yifl75chcws5y/bigdata.csv

It's 20k lines of 26 columns of random letters: each attribute has a cardinality of 26. If I browse to the example which uses in-browser CSV parsing here http://nicolaskruchten.github.io/pivottable/examples/local.html I'm able to load it up and make some fairly large 26x26 tables and charts very quickly and 26x26x26 tables and charts in around 1.5 seconds. Admittedly when I make tables that have two attributes in the columns and two in the rows (i.e. 26x26x26x26 cells) things slow down a LOT but it does eventually render a table.

Are you seeing different behaviour on your end? Does your data have much higher cardinality than 25 or so? How many cells are you expecting to see in the final table which is not generating correctly?

@nicolaskruchten
Copy link
Owner

Just following up... Did you get a chance to try with my sample data?

@projecthate
Copy link
Author

Hi Nicholas,

I was able to test this dataset this morning, and although it loaded smoothly, I do not think this can be considered a valid test-case since you're changing multiple variables: content and datasize. Your data takes up less than a quarter of the space than mine, so I would assume this would have an impact on how the pivot table is able to load the data into memory pre-rendering.

I'm not really sure what you're asking: "Are you seeing different behaviour on your end? Does your data have much higher cardinality than 25 or so? How many cells are you expecting to see in the final table which is not generating correctly?"

  1. No, I see that same behavior on my end, with your data, just not with mine.
  2. I don't know the data that well, but I would have to think the cardinality is pretty low.
  3. Well since the final table is never even visible, I would say 100% of the cells are not generating correctly...

@nicolaskruchten
Copy link
Owner

Hi again,

OK, I've regenerated the CSV file, it's at the same URL as before but now it's 7.5MB instead of 1MB and it's 50k lines instead of 20k lines. Still works well on my end... You too?

My third question wasn't about how many cells were generating correctly, I get that you're seeing 0 :) My question was more: if you saw what you were expecting, how many cells would that have?

Let me know, I'd love to help find a solution!

Cheers,
Nicolas

@projecthate
Copy link
Author

It still works for me too, despite sometimes giving a javascript error when the table gets too big.

I'm not sure how to answer that. When I am able to load the pivot table I'm not using any default parameters, the pivot table is instantiated with no columns or rows set, so I guess I expect to see 0?

@nicolaskruchten
Copy link
Owner

Hmm OK, in that case I'm sort of stumped.

Is the sample data file comparable to yours in terms of size now (number of rows, number of attributes, cardinality of attributes, total size in megabytes)?

@projecthate
Copy link
Author

Sure, it has more rows than I've been able to load so far, it has more columns than I'm trying to use, I think my cardinality must be lower and it's smaller in megabytes. The only differences I can see are content (you have a single letter per record... obviously this is easier for the javascript to parse) and formatting (json vs csv)

@nicolaskruchten
Copy link
Owner

So the attribute values aren't parsed unless some aggregator requires parsing. The default aggregator is count which doesn't parse so I doubt that's the issue. And right now I have 5 letters per attribute, just to make things bigger.

Perhaps it's just a JSON parsing thing but I've worked with larger JSON objects in the past. I'll try to regenerate my data in JSON and we'll see what's what :)

Is the code you're using similar to the code in one of the examples or is there a lot of other stuff going on as well?

@projecthate
Copy link
Author

The configuration of the pivot table is pretty basic:

google.load("visualization", "1", {packages:["corechart", "charteditor"]});
$(function()
{
var derivers = $.pivotUtilities.derivers;
var renderers = $.extend($.pivotUtilities.renderers, $.pivotUtilities.gchart_renderers);

$.getJSON("mpsAssum.php", function(mps) 
{
    $("#output").pivotUI(mps, 
    {
        renderers: renderers,
        cols: ["Product Name"], 
        rows: ["Dossier"],
        rendererName: "Table"
    });
});

@nicolaskruchten
Copy link
Owner

OK. Have you ruled out the possibility that the browser is choking on parsing the incoming JSON? I.e. if you put an alert(mps.length) right before the pivotUI() call, does it work?

@nicolaskruchten
Copy link
Owner

Hi, is there any update on this issue?

@projecthate
Copy link
Author

Sorry Nicolas, we've had to focus on other areas of development so we are
limiting the data. I may revisit a larger data set later and will certainly
update you on anything I discover.

Thanks for developing such a great plugin.
On Oct 5, 2013 9:55 AM, "Nicolas Kruchten" [email protected] wrote:

Hi, is there any update on this issue?


Reply to this email directly or view it on GitHubhttps://github.com//issues/36#issuecomment-25748919
.

@projecthate
Copy link
Author

Hi Nicolas,

We're taking another pass at getting the PivotTable to accommodate our data volume rather than restricting the input.

I was wondering if you could provide a copy of the dataset used to support the following "Why is it good?" statement: "works acceptably fast in Chrome on commodity hardware up to around a hundred thousand records, depending on the cardinality of the attributes."?

As I began re-testing this I started where you advised, including an alert before the pivotUI call. In lower-volume sets (~<20,000) the alert loads in a timely fashion, however, when increasing the number of rows towards 50,000, I never get an alert (or do after many, many minutes of seeing a blank page).

I believe this confirms your initial suspicion, that the browser is choking when it tries to parse my JSON, right?

If so, I'm not sure how to proceed. I guess I can accept the fact that I just can't load that many records due to the complexity of my data in the JSON, however, that would be a shame.

A colleague suggested that this problem may be alleviated, at least in part, by moving some of the data processing (especially the aggregation functions) to PHP rather than Javascript. This will require many more AJAX calls when the user changes chart type, plotted columns, etc., but I think the performance increase would justify the development time. Did you ever consider or put any time into creating a version that supports some level of server-side processing?

Do aggregated values get calculated when the PivotTable is initialized or when that particular aggregation type is selected?

@nicolaskruchten
Copy link
Owner

Hi again,

Thanks for getting back to me! It seems from your tests with the alert() calls that the problem is indeed happening outside the pivot library itself, in the data parsing/preparation step...

If the issue is a problem strictly with JSON parsing (rather than simply manipulating a very large array in memory), then perhaps you could try the same solution as the one that worked for issue #29 which was to do multiple requests to the server for smaller chunks of JSON-formatted data, which are then parsed separately and merged before passing them to the pivot table. You could also switch to a simpler JSON format: right now you seem to be using the array-of-objects format (based on one of your much-earlier answers) but the JSON would be far far smaller if you used the array-of-arrays format. (Examples here: https://github.com/nicolaskruchten/pivottable/wiki/Input-Formats)

Here my answers to your other questions:

  • The "why is it good?" statements are based on the types of datasets which I've provided to you above.
  • I have not put any thought into something which interacts with a server in any particular way as I want to keep this library strictly client-side.
  • In terms of aggregated values, they are computed during the rendering phase, depending on which aggregation type is selected.

@projecthate
Copy link
Author

Wow thanks for the fast reply.

Thanks for the suggestion on JSON data formatting, I am going to try converting my datasource from array of objects to array of arrays later today to see if that affects performance at all. However, as to your suggestion of chunking the JSON, I will quote my original post: "But I'm not really sure how to do this. I can't find anything in the documentation that lets data be loaded in piecemeal, after the initial instantiation of the table. Is this possible?" Your reply seemed to indicate that there was no way to load data to the PivotTable in segments, so I did not continue thinking about this. Can you provide any pointers as to how the author of #29 was able to modify the plugin's core to allow for this?

I revisited the cardinality discussion from prior posts and researched the actual meaning a little bit more. Unfortunately I think our data has an unlimited level of cardinality, if such a thing is possible? I don't believe any two values in our set are the same. And if they are, it is purely coincidence as they are very precise: eg. have many decimal points.

@nicolaskruchten
Copy link
Owner

Since we've narrowed the issue down to JSON parsing, let's leave cardinality issues aside for the moment...

My chunking suggestion, and the solution to issue #29, is not to load the data into the pivot table piecemeal, but rather to parse it piecemeal, merge it and to load it in one go. Compare the following pseudocode:

var input = JSON.parse(string_that_is_too_big) // this takes way too long
//now call x.pivot(input...)

with this:

// step 1: parse
var input_chunk1 = JSON.parse(chunk1_string); // chunk1_string is small enough to work
var input_chunk2 = JSON.parse(chunk2_string); // chunk2_string is small enough to work
//step 2: merge
var input = [].concat(input_chunk1, input_chunk2);
//now call x.pivot(input...)

Does this make sense?

@projecthate
Copy link
Author

Yes, that makes a lot of sense, thanks for the walkthrough. I'm going to try this out later tonight. Do you recommend I try this before changing from array of objects to array of arrays?

I'm going to include some more info I uncovered from my testing:
1k 149kb 2.2s 2.748s
10k 1.4MB 2.5s 5.828s
15k 2.1MB 3.3s 8.193s
15.1k 2.1MB 2.8s 8.452s
15.2k 2.1MB 2.6s 7.586s
15.5k 2.2MB 2.5s -----
17.5k 2.4MB 2.6s -----
20k 2.8MB 2.8s -----
25k 3.5MB 3.0s -----
50k 6.9MB 3.8s -----

Where values correspond to number of rows, size of JSON response, time to retrieve JSON from server and time to render PivotTable after $(document).ready() - Using ---- to indicate that the PivotTable never became visible.

As you can see, as before in my OP, some sort of threshold is being reached after 15.2k records. This led me to think that there was some kind of data problem, so I changed somethings and found:

20k 2.8MB 2.7s 9.547s -- every column (replacing " and ' in value and column)
50k 6.9MB 3.8s 29.25s -- every column (replacing " and ' in value and column)
55.5k 7.6MB 3.93s 34.551s -- every column (replacing " and ' in value and column)

60k 8.2MB 4.44s -- every column (replacing " and ' in value and column) - ERRORS Uncaught RangeError: Maximum call stack size exceeded
70k 9.6MB 4.49s -- every column (replacing " and ' in value and column) - ERRORS Uncaught RangeError: Maximum call stack size exceeded
80k 11.0MB 7.12s -- every column (replacing " and ' in value and column) - ERRORS Uncaught RangeError: Maximum call stack size exceeded
100k 13.7MB 5.88s -- every column (replacing " and ' in value and column) - ERRORS Uncaught RangeError: Maximum call stack size exceeded
m
c.querySelectorAll.m
f.fn.extend.find
(anonymous function)
(anonymous function)
$.fn.pivotUI
(anonymous function)
o
p.fireWith
w
d

As you can see, when I use str_replace in PHP to remove quotes from the values, before feeding the JSON to the PivotTable, I am able to load far more records (up to 55.5k) however the table takes upwards of thirty seconds to render (after jquery.ready). Is there anything I can do to optimize this time?

And I've just been informed that our prior target of being able to handle 300,000 rows is not still accurate. We now have 1.7 million records and that is potentially the tip of the iceberg. Do you think there's a chance if I'm chunking, using arrays of arrays and doing a lot of the math on the server side that it could potentially handle this much data?

@projecthate
Copy link
Author

Another issue we're having is that the PivotTable is not rendering within IE7 or 8.

I mean, I see the skeleton of a table which only says "Total." No configuration options or data is visible.

Have you encountered this at all? I know from browsing to the demo that the PivotTable is supported, and just that the Google Charts portion is excluded, but I'm not sure what could be different between your demo page and my test environment that's breaking IE7/8. Any insights?

@nicolaskruchten
Copy link
Owner

OK, lots going on here!

  1. clearly there are JSON-parsing issues at play here, which is a little outside of PivotTable-related support unfortunately

  2. 1.7M or even 300k rows is going to be a lot for a browser to handle, so I would recommend sending either a subset or a pre-aggregated set of rows from the server. This is generally the approach I take: do as much processing server-side as I can.

  3. I've no idea what you've changed in your environment that would be breaking IE, sorry :)

@gibito
Copy link

gibito commented Nov 9, 2015

Hi @nicolaskruchten
After extensive testing, cardinality is definitely the culprit here. You can have 1,000,000 rows with recurrent items and all is fine. When you have even only 500 rows and each row has a different item, then we hang and crash the browser. Could you show us the points where we can insert defensive coding preventing this? Thanks.

@nicolaskruchten
Copy link
Owner

I would be very interested in seeing a 500-row input which will cause this to hang! Can you share it please?

@gibito
Copy link

gibito commented Nov 15, 2015

Simply construct an array like:

[
{color: "color1", shape: "shape1", owner: "owner1", ....},
....
{color: "color500", shape: "shape500", owner: "owner500", ....}
]

This does not allow any grouping and table buildup/DOM nodes become the bottlenecks.

@nicolaskruchten
Copy link
Owner

Ah OK if you actually try to build a 10M-cell table, yeah, rendering will be a problem. But the initial parse of the data and default render (i.e. one cell containing the counts) will work etc.

@gibito
Copy link

gibito commented Nov 16, 2015

Yes, parsing is no the issue. HTML table build up and rendering are.

Have you considered using alternative to HTML table that use virtual rendering of the viewport? (e.g., Slickgrid)

Or any other architectural strategy you can recommend to prevent the browser from hanging?

@nicolaskruchten
Copy link
Owner

Certainly one could write a new Renderer which uses a different strategy than DOM operations to build a table. That said, I'm not sure how useful it is to be able to render tables that big and sparse, from a UI point of view, so I have not put any effort into solving this problem... :)

@gibito
Copy link

gibito commented Nov 16, 2015

Fully agree that it is not useful. It is important to prevent user from hanging the browser by mistake (e.g., automatic projections of rows in a spreadsheet). I am talking about defensive coding.

It would be great if you could point us to a place in the code where we can stop DOM operations in case of high cardinality. Thanks.

@nicolaskruchten
Copy link
Owner

Sorry for the long delay in answering here.... You could stop DOM operations in the TableRenderer based on the number of rows/cols.

@yolnda
Copy link

yolnda commented Sep 26, 2016

Hi Nicolas,

how about get data from url ?
use pivotjs

@nicolaskruchten
Copy link
Owner

@yolnda please don't comment on old, closed issues. Please create a new issue with a clearer description of your question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants