-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve serialization of Pandas DataFrames to ipyvega #2471
Comments
More efficient data serialization would be useful, but such changes would first have to be supported in Vega-Lite. |
Thanks @jdfekete for raising the issue, and also flagging @domoritz. Scale is a recurring issue for Altair users, at least as evidenced in my visualization courses at UW. (Some students benefit from the altair data server package, but that is not a one-size-fits-all solution.) Right now the scalability experience in Observable notebooks (where the data is already in JS) is often much better than with Altair due to this serialization overhead. While I agree with @jakevdp that more might be done in Vega itself, perhaps there is also space for handling data serialization in the generated HTML/JS prior to invoking Vega/Vega-Lite. For example, one could imagine serializing a data table to an Apache Arrow byte array in Python and then passing that instead (even if only as a base64-encoded string) to be deserialized using the Arrow JS or Arquero libraries. If so, it seems to me the costs involved would largely be (1) having to load additional JS libraries client side, and (2) format-contingent HTML/JS code generation for deserializing data before passing it to Vega. How feasible might it be to have some kind of small plug-in system in Altair and/or ipyvega that allowed customized code for (a) serializing data on the Python side, and (b) adding library imports and deserialization code on the client side? |
I absolutely agree that improving data serialization would be a huge improvement. The way I see it, Altair is a Python API to generate Vega-Lite specs and these specs can be rendered in different platforms. Therefore, we may need to look at each of the platforms and improve serialization there. When I was working with Streamlit, I added some code to separate the data from the chart specification so that the data can be sent as an Arrow table. You can see how I did it at https://github.com/streamlit/streamlit/blob/9714e3e6f852c26e3f8a155d39c2d5028dff1d71/lib/streamlit/elements/altair.py#L305. We could do something similar in ipyvega (vega/ipyvega#345). I think sending the data as Arrow makes the most sense since it's columnar and even binary so e.g. floating point numbers are much more compact than as strings. I don't think the overhead of Arrow JS in ipyvega is too large so I think we could always add it. We should measure the impact of serialization/deserialization compared to JSON to determine whether we want a flag to control whether the data is transferred as Arrow or JSON. |
Closing this as there is nothing to do on the Altair side of things. See vega/ipyvega#346 for the current progress on this feature. |
I also want to point to https://vegafusion.io, which not only has efficient transport but also offloads computation to the backend making charts much more responsive. |
Hi,
Thanks for Altair. I have created a feature request issue for ipyvega that could also impact Altair:
vega/ipyvega#345
It boils down to creating a serializer to efficiently send a Pandas DataFrame to vega. Currently, the communication in notebooks between python and fs is very inefficient, especially with the row-wise verbose json format. It limits the amount of data that can be reasonably sent to js, and limits the visible performance of Altair.
I am interested to see if this point is important, critical, or just secondary to Altair's adoption. I think that the limitation of data size is an issue but I may be biased. Please, comment on my feature request so I can decide how to address it.
Thanks in advance,
Jean-Daniel
Please follow these steps to make it more efficient to respond to your feature request.
The text was updated successfully, but these errors were encountered: