-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEEDBACK: PyArrow as a required dependency and PyArrow backed strings #54466
Comments
Something that hasn't received enough attention/discussion, at least in my mind, is this piece of the Drawbacks section of the PDEP (bolding added by me):
I honestly don't understand how mandating a 170% increase in the effective size of a pandas installation (70MB to 190MB, from the numbers in the quoted text) can be considered okay. For that kind of increase, I would expect/want the tradeoff to be major improvements across the board. Instead, this change comes with limited benefit but massive bloat for anyone who doesn't need the features PyArrow enables, e.g. for those who don't have issues with the current functionality of pandas. |
Debian/Ubuntu have system packages for pandas but not pyarrow, which would no longer be possible. (System packages are not allowed to depend on non-system packages.) I don't know whether creating a system package of pyarrow is possible with reasonable effort, or whether this would make the system pandas packages impossible to update (and eventually require their removal when old pandas was no longer compatible with current Python/numpy). |
Yeah unfortunately this is where the subjective tradeoff comes into effect. pytz and dateutil as required dependencies have a similar issue for users who do not need timezone or date parsing support respectively. The hope with pyarrow is that the tradeoff improves the current functionality for common "object" types in pandas such as text, binary, decimal, and nested data.
AFAIK most pydata projects don't actually publish/manage Linux system packages for their respective libraries. Do you know how these are packaged today? |
The pytz and dateutil wheels are only ~500kb. Drawing a comparison between them and PyArrow seems like a stretch, to put it lightly. |
By whoever offers to do it, currently me for pandas. Of the pydata projects, Debian currently has pydata-sphinx-theme, sparse, patsy, xarray and numexpr. An old discussion thread (anyone can post there, but be warned that doing so will expose your non-spam-protected email address) suggests that there is existing work on a pyarrow Debian package, but I don't yet know whether it ever got far enough to work. |
Hi, Thanks for welcoming feedback from the community. While I respect you decision, I am afraid that making
Packages size
Have you considered those two observations as drawbacks before taking the decision? |
This is discussed a bit in https://github.com/pandas-dev/pandas/pull/52711/files#diff-3fc3ce7b7d119c90be473d5d03d08d221571c67b4f3a9473c2363342328535b2R179-R193 While currently the build size for pyarrow is pretty large, it doesn't "have" to be that big. I think by pandas 3.0 (cc @jorisvandenbossche for more info on this) I'm not an Arrow dev myself, but if is something that just needs someone to look at, I'm happy to put some time in help give Arrow a nudge in the right direction. Finally, for clarity purposes, is the reason for concern also AWS lambda/pyodide/Alpine, or something else? (IMO, outside of stuff like lambda funcs, pyarrow isn't too egregious in terms of package size compared to pytorch/tensorflow but it's definetely something that can be improved) |
If Edit: See conda-forge/arrow-cpp-feedstock#1035 |
Hi, Thanks for welcoming feedback from the community. With |
There is another way - use virtual environments in user space instead of system python. The Python Software Foundation recommends users create virtual environments; and Debian/Ubuntu want users to leave the system python untouched to avoid breaking system python. Perhaps Pandas could add some warnings or error messages on install to steer people to virtualenv. This approach might avoid or at least defer work of adding pyarrow to APT as well as the risks of users breaking system python. Also which I'm building projects I might want a much later version of pandas/pyarrow than would ever ship on Debian given the release strategy/timing delay. On the other hand, arrow backend has significant advantages and with the rise of other important packages in the data space that also use pyarrow (polars, dask, modin), perhaps there is sufficient reason to add pyarrow to APT sources. A good summary that might be worth checking out is Externally managed environments. The original PEP 668 is found here. |
I think it's the rigth path for performance in WASM. |
This is a good idea!
|
Regarding concat: This should already be zero copy:
This creates a new dataframe that has 2 pyarrow chunks. Can you open a separate issue if this is not what you are looking for? |
@phofl
|
If this happens, would We’re currently thinking about coercing strings in our library, but hesitating because of the unclear future here. |
Arrow is a beast to build, and even harder to fit into a wheel properly (so you get less features, and things like using the slimmed-down libarrow will be harder to pull off). Conda-forge builds for py312 have been available for a month already though, and are ready in principle to ship pyarrow with a minimal libarrow. That still needs some usability improvements, but it's getting there. |
Without weighing in on whether this is a good idea or a bad one, Fedora Linux already has a I’m not saying that Pandas is easy to keep packaged, up to date, and coordinated with its dependencies and reverse dependencies! Just that a hard dependency on PyArrow wouldn’t necessarily make the situation worse for us. |
@h-vetinari Almost there? :-) |
There is still a lot of work to be done on the wheels side but for conda after the work we did to divide the CPP library, I created this PR which is currently under discussion in order to provide both a |
Thanks for requesting feedback. I'm not well versed on the technicalities, but I strongly prefer to not require pyarrow as a dependency. It's better imo to let users choose to use PyArrow if they desire. I prefer to use the default NumPy object type or pandas' StringDType without the added complexity of PyArrow. |
Have you tried the layer in the link above? It is not going to be a 120 MB increase because AWS is not building a pyarrow wheel with all of the same options - looks like they remove Gandiva and Flight support |
@WillAyd 179 MB is the layer's size. |
Very helpful thanks. And the size of your current pandas + numpy + botocore + fastparquet images are significantly smaller than that? |
I don't think that's a proper comparison as AWS data Wrangler will also
have support to read parquet files for which for now I resort to
fastparquet for it's smaller size.
…On Thu, 9 May 2024, 19:28 William Ayd, ***@***.***> wrote:
Very helpful thanks. And the size of your current pandas + numpy +
botocore images are signifcantly smaller than that?
—
Reply to this email directly, view it on GitHub
<#54466 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIGHCM3KKESGOBDORQENI23ZBN6HNAVCNFSM6AAAAAA3JOMQ4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBSG4YTMMBWGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Also to fetch files from S3 while avoiding downloading file and then loading, s3fs is required which I guess won't be required when using AWS sdk (not sure though). |
Yea ultimately what I'm trying to guage is how big of a difference it is. I don't have access to any lambda environments, but locally if I install your stack of pandas + numpy + fastparquet + botocore I get the following installation sizes in my site-packages folder: 75M pandas
39M numpy
37M numpy.libs
25M botocore
16M pip
7.9M fastparquet Adding up to almost 200 MB just from those packages alone. If AWS is already distributing an image with pyarrow that is smaller than this then I'm unsure about the apprehension to this proposal on account of lambda environments. Is there a significant use case why users cannot use the already distributed AWS environment that includes pandas + pyarrow and if so why should that be something that holds pandas developers back from requiring pyarrow? |
As of a few hours ago, there's a The split of the cloud provider bindings out of core hasn't happened yet, but will further reduce the footprint once it happens. |
I think the pdep text wasn't precise here - pandas and numpy each require about 70MB (in fact, a bit more now, I just checked). So the percentage of the increase is more like 82% - not 170%. Still quite a lot, I don't mean to minimise it, but at lot less than has been stated here. It's good to see that on the conda-forge side, things have become smaller. For the PyPI package, however, my understanding is that this is unlikely to happen any time soon
I just tried this, and indeed, it works - pandas 2.2.2 and pyarrow 14.0.1 are included. I don't think it's as flexible as being able to install whichever versions you want, but it does seem like there is a workable way to use pandas in Lambda |
**warning:** Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 ```py import pandas as pd ``` Signed-off-by: Avelino <[email protected]>
I would ask the pandas developers to consider the impact of this decision on PyScript/Pyodide. The ability to develop statistical tools that can be deployed as a web app (where it is using their CPU and not a server) is a game changer, but it does mean the web browser is downloading all the packages the site needs. I'd also note, that many packages (e.g., Scipy) require numpy, so the likely result is that both packages will end up being downloaded. I'd also ask the developers consider numba (outside the WASM environment). A lot of scientific code is accelerated by numba which implements parts of numpy (among other things). My point is that it is unlikely this code can just be replaced with pyarrow code. Again, both will end up being installed. |
I think more people will comment on this in the form of backlash when they realize it has been done without them being aware. While we understand the value of PyArrow, it is not an absolute necessity for pandas as demonstrated by historical performance and adoption. PyArrow is already available for those that need/want it. Pandas should have pyarrow integration but not as a requirement for Pandas to function. As a pyodide/wasm developer , I can attest that payload size is paramount. Pyarrow is just too big. Make the PyArrow integration easy, but not mandatory. Think about more than the big data use case. |
Updating to
uninstalling The point is that an extra dependency (especially such a huge one) increases fragility. |
Hi,
|
Not to beat a dead horse, but.... I use Pandas in multiple projects, and each project has a Virtual Environment. Every new major version of python gets a virtual environment for testing the new version too. The size of these project is not huge, but now they have all increased massively, and the storage requirement for projects has increased almost exponentially. Just something to keep in mind. I know there is talk of pyarrow being reduced in size too, which would be great. I admit, I have not read the full discussion, so this may have been covered already, and I apologize if it has been. |
Hi all – not to segue into the discussion about the increase in bandwidth usage and download sizes since many others have put out their thoughts about that already, but PyArrow in Pyodide has been merged and will be available in the next release: pyodide/pyodide#4950 |
I find this error in the lab of module 2-course 3 data science: :1: DeprecationWarning: import pandas as pd # import library to read data into dataframe |
It's a bit unfortunate that with |
Reading this thread, it appears that after more than 12 months of collecting feedback, most comments are not in favor of pyarrow being a dependency, or at least voice some concern. I haven't done a formal Concerns
Suggested paths forward a. Make it easy to use pandas with pyarrow, yet keep it an optional dependency (I may be biased in summarizing this, anyone feel free to correct this if you find your analysis is different) Since this is a solicited feedback channel established for the community to share their thoughts regarding PDEP-10, (how) will the decision be reconsidered @phofl? Thank you for all your efforts. |
There is an open PDEP under consideration to reject pdep-10. #58623 If (when?) it gets finalized, it'll get put to a vote. |
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at pandas-dev/pandas#54466 import pandas as pd
This is an issue to collect feedback on the decision to make PyArrow a required dependency and to infer strings as PyArrow backed strings by default.
The background for this decision can be found here: https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html
If you would like to filter this warning without installing pyarrow at this time, please view this comment: #54466 (comment)
The text was updated successfully, but these errors were encountered: