You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
The serialization of dataframes from python to vega in json is very inefficient, even for smallish datasets.
The https://github.com/vidartf/ipydatawidgets provide a mechanism to improve the serialization of numpy arrays, which is already a step.
For our project ProgressiVis, we are considering serializing a dataframe as a dictionary of columns (column-wise representation), where each column can be compressed from python and decompressed in js according to its type.
At the vega level, converting a column-wise format to vega's internal format has already been done for the "arrow" format in https://github.com/vega/vega-loader-arrow so it would not be hard to do it for a dictionary of columns.
In between, ipydatawidget uses gzip compression but there are other trade-offs, such as lz4.
The implementation is not hard but could take a couple of weeks and it would be great to be able to reuse it to send other dataframe formats if possible (e.g. our progressive tables would use the same serialization format).
How important would that kind of optimization be for ipyvega/Altair? low-priority? high-priority? Is anyone else interested in improving the data serialization for other dataframe formats?
Best,
Jean-Danel
The text was updated successfully, but these errors were encountered:
Adding to vega/altair#2471 (comment), I would say better serialization would be a great improvement and I am very supportive. I would suggest using Arrow as there is good support in Python and JS and more backends are adopting it as their internal representation.
Hi,
The serialization of dataframes from python to vega in json is very inefficient, even for smallish datasets.
The https://github.com/vidartf/ipydatawidgets provide a mechanism to improve the serialization of numpy arrays, which is already a step.
For our project ProgressiVis, we are considering serializing a dataframe as a dictionary of columns (column-wise representation), where each column can be compressed from python and decompressed in js according to its type.
At the vega level, converting a column-wise format to vega's internal format has already been done for the "arrow" format in https://github.com/vega/vega-loader-arrow so it would not be hard to do it for a dictionary of columns.
In between, ipydatawidget uses gzip compression but there are other trade-offs, such as lz4.
The implementation is not hard but could take a couple of weeks and it would be great to be able to reuse it to send other dataframe formats if possible (e.g. our progressive tables would use the same serialization format).
How important would that kind of optimization be for ipyvega/Altair? low-priority? high-priority? Is anyone else interested in improving the data serialization for other dataframe formats?
Best,
Jean-Danel
The text was updated successfully, but these errors were encountered: