Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISC: nanoarrow-backed ArrowStringArray #58552

Open
3 tasks done
WillAyd opened this issue May 3, 2024 · 9 comments
Open
3 tasks done

DISC: nanoarrow-backed ArrowStringArray #58552

WillAyd opened this issue May 3, 2024 · 9 comments
Labels
Arrow pyarrow functionality Enhancement

Comments

@WillAyd
Copy link
Member

WillAyd commented May 3, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Wanted to add a formal issue around the possibility of using nanoarrow to generate the ArrowStringArray class we have today. This could also help move forward with pandas 3.x if we decided to drop the pyarrow requirement

What is nanoarrow?
nanoarrow is a small, lightweight library used to generate data that follows the Arrow format specification. It can be used by libraries that want to work with Arrow but do not want to take on the dependency of the larger Arrow code base. The Arrow ADBC library is an example where this has already been used.

How could we leverage nanoarrow within pandas?
Sparing implementation details, a reasonable usage directly within our code base would be to change our existing ArrowStringArray class. Today the constructor looks something like:

def __init__(self, values) -> None:
    _chk_pyarrow_available()
    if isinstance(values, (pa.Array, pa.ChunkedArray)) and pa.types.is_string(
        values.type
    ):
        values = pc.cast(values, pa.large_string())

    ...

In theory we could do something like:

def __init__(self, values) -> None:
    _uses_pyarrow = pa_installed()
    if _uses_pyarrow:
        if isinstance(values, (pa.Array, pa.ChunkedArray)) and pa.types.is_string(
            values.type
        ):
            values = pc.cast(values, pa.large_string())
    else:
        values = NanoStringArray(values)

    ...

In each method, our internal ArrowStringArray would prioritize pyarrow algorithms if installed, but could fall back to our own functions implemented using nanoarrow (or raise if building such a function is impractical).

def _str_isalnum(self):
    if _uses_pyarrow:
        result = pc.utf8_is_alnum(self._pa_array)
        return self._result_converter(result)

    # nanoarrow fallback
    return self._pa_array.isalnum()

This repurposes the internal self._pa_array to actually refer to an Arrow array and not necessarily an Arrow array created by pyarrow. That can definitely be a point of confusion if a developer is not aware.

Would this be a new dtype?
No, which is what makes this very distinct from other solutions to the pyarrow installation problem. Whether or not pyarrow is installed or we use nanoarrow behind the scenes, the theory is that we produce Arrow arrays and operate against them. Alternate solutions to the pyarrow installation problem start to direct users towards using different data types; this solution is merely an implementation detail

This may be confusing that we named our data types "string[pyarrow]" and they could be produced without pyarrow. If we had named our data types "string[arrow]" it would have abstracted this fully; with that said I don't think it is worth changing.

Would we need to vendor nanoarrow?
No, assuming we drop setuptools support. Historically when pandas has taken on third party C/C++ libraries we copy the files into our code base and maintain them from there. With Meson, we can leverage the Meson wrap system and it will work

Does this require any new tooling within pandas?
Not really. The expectation is that the algorithms we need would be implemented in C++. pandas already requires a C++ compiler, and the libraries we would need to produce C++ extensions should be installable via Meson

How do we know this could work for pandas?
I wrote a proof of concept for this a few months back - https://github.com/WillAyd/nanopandas

Getting that to work with pandas directly was challenging because pandas currently requires EAs to subclass a Python class, which the C++ extension could not do. I do not expect this would be a problem if we decided to use nanoarrow directly within our existing ArrowStringArray class instead of trying to register as an EA

How fast will it be?
My expectation is the performance will fall somewhere in between where we are today and where pyarrow gets us. Pyarrow offers a lot of optimizations and the goal of this is not to try and match those. Users are still encouraged to install pyarrow; this would only be a fallback for cases where pyarrow installation is not feasible

Is this a long term solution?
I don't think so. I really want us to align on leveraging all the great work that Arrow/pyarrow has to offer. I only consider this a stepping stone to get past our current 3.x bottleneck and move to a more Arrow-centric future, assuming:

  1. We continually encourage users to install pyarrow with pandas
  2. Pyarrow installation becomes less of a concern for users over time (either by pyarrow getting smaller, container environments getting bigger, and/or legacy platforms dying off)

However, even if/when this nanostring code goes away I do think there are not-yet-known future capabilities that can be implemented using the nanoarrow library utilized here

How much larger would this make the pandas installation?

From the nanopandas POC project listed above, release artifacts show the following sizes for me locally:

  • nanobind static library - 376K
  • nanoarrow static library - 96K
  • utf8proc static library - 340K
  • nanopandas shared library - 748K

So overall I would expect ~1.5 MB increase. (for users that care - utf8proc is the UTF8 library Arrow uses. nanobind is a C++ extension binding library)

What are the downsides
As a team we have not historically created our own C++ extensions; adding a new language to the mix is not something that should be taken lightly. I think the flip side to this is that we already have C extensions with the same issue around maintenance, so I am not really sure how to measure this issue.

The library used to bridge Python/C++, nanobind, does not offer first class support for Meson. I believe Meson can still handle this robustly given its wrap system, and its something that has been discussed upstream in nanobind, but still worth calling out as a risk

Feature Description

See above

Alternative Solutions

#58503

#58551

Additional Context

No response

@WillAyd WillAyd added Enhancement Arrow pyarrow functionality labels May 3, 2024
@WillAyd
Copy link
Member Author

WillAyd commented May 3, 2024

Yet another possible solution to the 3.0 string problems that @simonjayhawkins @jorisvandenbossche @phofl @MarcoGorelli @lithomas1 have been actively discussing. I think the main point of this discussion is that there would be no new string dtypes - we just manage pyarrow installation as an implementation detail.

@WillAyd
Copy link
Member Author

WillAyd commented May 6, 2024

I did not open a PR with everything else going on but you can see an initial diff of this here:

https://github.com/pandas-dev/pandas/compare/pandas-dev:pandas:d765547...WillAyd:pandas:nanopd-integration?expand=1

If you checkout that branch some cherry-picked methods / accessors"work" (the return types are native nanopandas type - needs effort to wrap properly):

>>> import pandas as pd
>>> ser = pd.Series(["x", "aaa", "fooooo"], dtype="string[pyarrow]")
>>> ser.str.upper()
StringArray
["X", "AAA", "FOOOOO"]
>>> ser.str.len()
Int64Array
[1, 3, 6]
>>> ser.iloc[0]
'x'
>>> ser.iloc[1]
'aaa'
>>> ser.iloc[2]
'fooooo'
>>> ser.iloc[3]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/willayd/clones/pandas/pandas/core/indexing.py", line 1193, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/home/willayd/clones/pandas/pandas/core/indexing.py", line 1754, in _getitem_axis
    self._validate_integer(key, axis)
  File "/home/willayd/clones/pandas/pandas/core/indexing.py", line 1687, in _validate_integer
    raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds

I have not put any effort into trying to optimize nanopandas, but out of the box performance is better than our status quo:

In [1]: import pandas as pd

In [2]: ser = pd.Series(["a", "bbbbb", "cc"] * 100_000)

In [3]: %timeit ser.str.len()
82.4 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit ser.str.upper()
52.8 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: ser = pd.Series(["a", "bbbbb", "cc"] * 100_000, dtype="string[pyarrow]")

In [6]: %timeit ser.str.len()
3.95 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit ser.str.upper()
26.7 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: import pyarrow
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[6], line 1
----> 1 import pyarrow

ModuleNotFoundError: No module named 'pyarrow'

@WillAyd
Copy link
Member Author

WillAyd commented May 6, 2024

...and just now fixed things so that a Series can be built from nanopandas arrays. The bool / integer are inefficient because they go nanopandas -> python -> NumPy but that could be implemented upstream in nanopandas without a lot of effort:

In [1]: import pandas as pd

In [2]: ser = pd.Series(["a", "bbbbb", "cc", None], dtype="string[pyarrow]")

In [3]: ser.str.len()
Out[3]: 
0       1
1       5
2       2
3    <NA>
dtype: Int64

In [4]: ser.str.isalnum()
Out[4]: 
0    True
1    True
2    True
3    <NA>
dtype: boolean

In [5]: import pyarrow as pa
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[5], line 1
----> 1 import pyarrow as pa

ModuleNotFoundError: No module named 'pyarrow'

@jorisvandenbossche
Copy link
Member

Yet another possible solution to the 3.0 string problems

I personally agree with you this is an interesting option to explore, but I would put it as "yet another possible solution for after 3.0" (that's how I also described it in the current PDEP text). At least for the current timeline of 3.0 (even if this is "a couple of months"), I don't think such a big change is realistic so quickly.

I think the main point of this discussion is that there would be no new string dtypes - we just manage pyarrow installation as an implementation detail.

FWIW, this is not necessarily unique to this solution. Also for the object-dtype vs pyarrow backends, we could do this without having two dtype variants, but by just making this choice behind the scenes automatically. It is our choice how we implement this.
(of course for object dtype vs pyarrow, the difference is bigger because also the stored memory is different, and not only the functions that are being called on it, but there is nothing that prevents us technically to follow the same approach)

@WillAyd
Copy link
Member Author

WillAyd commented May 7, 2024

Cool thanks for clarifying. Just to be clear on my expectations, if we were interested in this I would propose that we just spend however much time we feel we need on it before releasing 3.0. One of the major points is to avoid yet another string dtype, but if we've already released 3.0 with one then I don't know this would be worth it

Also for the object-dtype vs pyarrow backends, we could do this without having two dtype variants, but by just making this choice behind the scenes automatically

As PDEP 14 is written now, I don't think this is true, or arguably misleading. Yes we do have "string" today (and have for quite some time) so we can avoid adding a new data type by repurposing the old one and changing the na sentinel, but I think that can easily yield more problems.

Maybe it is more accurate to say no new string dtypes and no repurposing / breakage of existing types

@jorisvandenbossche
Copy link
Member

but if we've already released 3.0 with one then I don't know this would be worth it

AFAIU the main benefit of using nanoarrow would be to avoid falling back to an object-dtype based implementation (and so also give some performance and memory improvements in case pyarrow is not installed). That benefit is equally true before or after 3.0?

Maybe it is more accurate to say no new string dtypes and no repurposing / breakage of existing types

That is only true if it would use NA semantics, which I assume you are assuming here? But that that is one of the main points being proposed by the PDEP: if we introduce a string dtype for 3.0, it will use NaN. And so also if we would introduce a nanoarrow-backed version for 3.0, which you propose here, it should IMO also have to use NaN. And at that point you have all the same issues / discussions about dtype variants.

@WillAyd
Copy link
Member Author

WillAyd commented May 7, 2024

AFAIU the main benefit of using nanoarrow would be to avoid falling back to an object-dtype based implementation (and so also give some performance and memory improvements in case pyarrow is not installed). That benefit is equally true before or after 3.0?

Yea for sure. I think its just the issue of there being so many possible string implementations each with their own merits. I don't think its worth just adding one because it offers some incremental value after 3.0; I think it being a solution to the 3.0 problem and not requiring any other changes is the main draw

That is only true if it would use NA semantics, which I assume you are assuming here?

Yea that's correct. This is totally separate from any NA / np.nan discussions.

if we introduce a string dtype for 3.0, it will use NaN.

That is a breaking change for dtype="string", so with that proposal we either just do that or go through a deprecation cycle. This has the advantage of requiring neither

@jorisvandenbossche
Copy link
Member

I think it being a solution to the 3.0 problem ...
..
This is totally separate from any NA / np.nan discussions.

In that sense, for me it's entirely not a solution for the 3.0 problem, because in my mind the 3.0 problem is that we need to live with NaN being the default sentinel ;)

@WillAyd
Copy link
Member Author

WillAyd commented May 7, 2024

Ah OK. Well let's continue that conversation on the PDEP itself so we don't get too fragmented. But appreciated the responses here so far

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Enhancement
Projects
None yet
Development

No branches or pull requests

2 participants