Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve serialization of Pandas DataFrames to ipyvega #345

Closed
jdfekete opened this issue Jun 4, 2021 · 3 comments
Closed

Improve serialization of Pandas DataFrames to ipyvega #345

jdfekete opened this issue Jun 4, 2021 · 3 comments

Comments

@jdfekete
Copy link
Collaborator

jdfekete commented Jun 4, 2021

Hi,
The serialization of dataframes from python to vega in json is very inefficient, even for smallish datasets.
The https://github.com/vidartf/ipydatawidgets provide a mechanism to improve the serialization of numpy arrays, which is already a step.
For our project ProgressiVis, we are considering serializing a dataframe as a dictionary of columns (column-wise representation), where each column can be compressed from python and decompressed in js according to its type.
At the vega level, converting a column-wise format to vega's internal format has already been done for the "arrow" format in https://github.com/vega/vega-loader-arrow so it would not be hard to do it for a dictionary of columns.
In between, ipydatawidget uses gzip compression but there are other trade-offs, such as lz4.
The implementation is not hard but could take a couple of weeks and it would be great to be able to reuse it to send other dataframe formats if possible (e.g. our progressive tables would use the same serialization format).

How important would that kind of optimization be for ipyvega/Altair? low-priority? high-priority? Is anyone else interested in improving the data serialization for other dataframe formats?

Best,
Jean-Danel

@domoritz
Copy link
Member

domoritz commented Jun 4, 2021

Adding to vega/altair#2471 (comment), I would say better serialization would be a great improvement and I am very supportive. I would suggest using Arrow as there is good support in Python and JS and more backends are adopting it as their internal representation.

@domoritz
Copy link
Member

Done in #346 🎉

@domoritz
Copy link
Member

Version 4.0 with this feature is released. Thanks for all your work on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants