-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve serialization of Pandas DataFrames to ipyvega #346
Conversation
…ting the console. Add python doc.
This pull request fixes 6 alerts when merging a814023 into 7cda844 - view on LGTM.com fixed alerts:
|
Thank you for the pull request. I think this is a good start but I would like to discuss a few options with you.
Can you explain where these copies are happening with Arrow? When we use JSON, we already have to make a copy from pandas, no?
What do you mean you don't support multiple dataframes? I think it would be good if we could send data separately from the spec and support multiple datasets.
Better support specifically for Altair is great. Have you adopted the idea of separating data from specs I implemented in https://github.com/streamlit/streamlit/blob/96f17464a13969ecdfe31043ab8e875718eb0d10/lib/streamlit/elements/altair.py#L315?
Does this have much benefit over transparent gzip compression over HTTP? How big is the overhead for compression/decompression and the additional compies we make when we compress data? |
No. For int and float columns, there is no copy:
Currently, our proof of concept is based on the streaming API and we only send one dataframe at a time with the
No. Thanks for pointing to this mechanism, I will see how we can use it with our mechanism. Our examples use a similar but less flexible mechanism.
Yes it does, see: https://lemire.me/blog/2021/06/30/compressing-json-gzip-vs-zstd/ The compression "codecs" (such as lz4 or zlib) should be part of the library and pre-selected for casual users. If you know the distribution profile of your data column, a specific codec can really make a huge difference (See e.g. https://github.com/lemire/FastPFor). Standard zip compression used in HTTP is not efficient and flexible enough to accommodate data characteristics. |
This pull request fixes 6 alerts when merging 283fe75 into 7cda844 - view on LGTM.com fixed alerts:
|
Thanks for the notes. I think I would personally still prefer Arrow since it encodes pretty efficiently and is well supported. It will make it a lot easier to maintain ipyvega in the future if we don't roll our own serialization. |
Can you merge master to make this pull request ready? In particular, we should not be updating Vega-Lite in this pull request as Altair is still on Vega-Lite 4 for now and I want to coordinate the update with them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the pull request. I would love to get this in but it's not quite ready yet.
- What is
.gitmodules
? - Address all the comments in this pull request.
Thank you for making the updates @xtianpoli! Let me know when you are done and want me to make another review. |
All the comments have been addressed. |
I want to review this but somehow my python setup is botched and now I run into #418 (comment). Stay tuned. |
Thank you! I also added you as maintainers of this repo so you can triage issues in the issue tracker. |
With @xtianpoli, we have implemented an improved serialization of Pandas DataFrames for ipyvega.
It is not complete, we need to follow the rules of Altair for column type conversions, but at least, we have a noticeable speedup compared to the current version sending verbose json.
On the python side, for the VegaWidget, we have implemented a new traitlet type for a "Table". It is a dictionary of columns (a columnar representation) of a Pandas DataFrame (and potentially other tables) where we try to avoid copying anything, just point to low-level numpy arrays managed by Pandas, that can be serialized without copy using the buffer protocol.
Additionally, each column can be compressed with either zip or lz4, which boosts the transfer speed of columns.
On the other side, we translate the columnar table format into Vega internal tuples, again avoiding copies when possible.
Note that this serialization is only used by the streaming API since it requires using our traitlet in the VegaWidget, it cannot work inside a vega dataset.
Let us know if you disagree with our design choices.
There are a few possible pitfalls, such as sending multiple DataFrames, not supported (yet). If you see a clean way to do it, let us know.
We also provide some helpers for Altair, but we're not sure how to fully replace the standard Altair method to send data to the browser with ours. It would boil down, when a Altair-genenrated json spec is detected by the notebook, to wrap it with a VegaWidget and call update_dataframe on the Pandas DataFrame immediately after. If you can do that, then Altair would be boosted in a transparent way, able to support much larger datasets.
There are new notebooks to showcase the new capabilities and performances.
We did not use apache-arrow as an intermediary format since it would always make a copy, and since we want to handle large datasets, we want to avoid copying them in the first place.
Looking forwards to your comments, questions, and thoughts.
Best,
Jean-Daniel & Christian