Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statically link perspective-python and libarrow #1290

Merged
merged 7 commits into from
Jan 17, 2021
Merged

Conversation

sc1f
Copy link
Contributor

@sc1f sc1f commented Jan 12, 2021

This PR builds Apache Arrow as a static library and reconfigures perspective-python to no longer link against a version of PyArrow present on the system/installed as part of perspective-python dependencies.

Instead, Perspective's C++ library will link against the pre-compiled, minimal arrow static library, and the user is free to use any version of PyArrow they choose to install as long as the Arrow binary format is compatible with the binaries output by Perspective's version of Arrow, which is v1.0.1 at this time.

At present, Perspective does not import pyarrow or use any of Arrow's Python-specific C++ code. If we were to use PyArrow in the Python runtime, however (like in the test suite), Perspective's C++ version of Arrow will continue to work as expected—perspective-python will use its own version of Arrow, and pyarrow will use the libarrow.so that shipped with the PyArrow install. This means further integration with PyArrow is not blocked/degraded, and if anything should be accelerated as we now have a consistent Arrow version across all Perspective runtimes.

In effect, we have taken a "pinned" version of PyArrow that used to depend on external assets we could not fully/consistently control, and turned it into a pinned version of PyArrow that we have full management and control over.

Benefits

With this change, the complicated and admittedly brittle code we have in place to deal with detecting PyArrow's install location, managing PyArrow versions, copying Arrow DLLs (on Windows), etc. can be removed, as Perspective's build no longer depends on the existence and correct versioning of an asset we have little control over.

Additionally, users can install the newest version of PyArrow in order to gain access to features/fixes without worrying about Perspective compatibility beyond the binary format, which is expected to remain stable and compatible between versions. Finally, the developer experience should be improved for both source builds and configuring the conda-forge build, which can unpin its PyArrow dependency.

Testing

Perspective's WASM build has built and used the minimal version of Arrow since late 2019, and perspective-python's test suite contains a litany of tests that utilize PyArrow to generate/ingest arrow binaries to and from Perspective. This PR updates the Manylinux Dockerfiles and Python test suite to use PyArrow v2.0.0 (except for Python 2, which continues to use v0.16.0). This in effect gives us full coverage over forwards and backwards compatibility between our static Arrow pinned to v.1.0.1 and the versions of PyArrow used by the tests.

TODO:

  • Remove PyArrow maintenance code/requirements from setup.py, pyproject.toml etc.
  • Test that Perspective works with the latest version of PyArrow, and update the test suite to use latest Arrow
  • Remove PyArrow custom installs from all Dockerfiles, and regenerate Docker images
  • Clean up CMake code around PyArrow

@sc1f sc1f added breaking internal Internal refactoring and code quality improvement Python and removed cla-present labels Jan 12, 2021
@sc1f sc1f self-assigned this Jan 12, 2021
@texodus texodus changed the title Unlink PyArrow from perspective-python Statically link perspective-python and libarrow Jan 13, 2021
@sc1f sc1f marked this pull request as draft January 13, 2021 19:20
@sc1f sc1f force-pushed the unlink-pyarrow branch 5 times, most recently from 74e74b8 to 938421e Compare January 15, 2021 16:44
@sc1f sc1f removed the breaking label Jan 15, 2021
WIP windows flatbuffers

Install flatbuffers using vcpkg

link against rt for python2 manylinux

use vcpkg toolchain file

WIP

why not try chocolatey
pull in flatbuffers as an external dep

wip

fix windows build

nocopy arrow dlls
@sc1f
Copy link
Contributor Author

sc1f commented Jan 15, 2021

@texodus this PR now has the changes from #1215 in it, so once this is ready to merge we can close #1215.

@sc1f sc1f marked this pull request as ready for review January 15, 2021 23:10
@sc1f
Copy link
Contributor Author

sc1f commented Jan 15, 2021

@texodus tests now run against PyArrow 2.0.0, checked locally and on Azure Mac/Windows jobs. Just need to regen the docker images and should be all clear.

@timkpaine would you be able to pull down this branch and take a look if you have time? I don't have a Windows machine to verify, but the build should work as long as you have flatbuffers installed before build.

@timkpaine
Copy link
Member

timkpaine commented Jan 15, 2021

Need to check that an arrow passed in from python will be binary compatible with the version we built against, so expose the arrow api version into python and check against whatever attribute is there on the python side, otherwise it will likely just seg fault when the arrow versions are incompatible

right now if arrow versions are incompatible, you get the error message that disables the c++ code, so something similar but just for arrow

@texodus
Copy link
Member

texodus commented Jan 17, 2021

@timkpaine Thanks for the feedback! I don't think this is necessarily an issue, since

  • None of the logic of how we use pyarrow internally has changed at all, and this PR does not modify any .py files.
  • The arrow::Table load logic used in C++ does version validation (though these don't yet work well in js, see Async Table and View Constructor #1289).
  • The format is backwards-compatible (?) from 0.16
  • Adding pyarrow to check the version from the Python-side, re-introduces the exact dependency on pyarrow that this PR seeks to remove.

I dug up a very old superstore.arrow from Perspective 0.1.0 on unpkg here, and it throws an exception with a not super helpful error message but does not segfault:

Failed to open RecordBatchStreamReader: Invalid: Expected to read 1329865020 metadata bytes, but only read 36639

I concur this needs work, lets aim to improve the quality of our error reporting in a PR which targets this specifically.

# Chocolatey for our Azure Windows job.
if (WIN32)
psp_build_dep("flatbuffers" "${PSP_CMAKE_MODULE_PATH}/flatbuffers.txt.in")
endif()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do the flatbuffers headers need to be available only on Windows?

you have installed. To disable this, pass the `--no-build-isolation` flag to
pip.

- Flatbuffers not installed prior to installing Perspective
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a newline, markdown is picky sometimes

@texodus
Copy link
Member

texodus commented Jan 17, 2021

Looks good! Thanks for the PR!

@texodus texodus merged commit 1725f48 into master Jan 17, 2021
@texodus texodus deleted the unlink-pyarrow branch January 17, 2021 13:04
@timkpaine
Copy link
Member

I dug up a very old superstore.arrow from Perspective 0.1.0 on unpkg here, and it throws an exception with a not super helpful error message but does not segfault:

Failed to open RecordBatchStreamReader: Invalid: Expected to read 1329865020 metadata bytes, but only read 36639

I concur this needs work, lets aim to improve the quality of our error reporting in a PR which targets this specifically.

So long as it fails in python and not in c++ (e.g. no segfault, user can handle in python) its "fine".

Adding pyarrow to check the version from the Python-side, re-introduces the exact dependency on pyarrow that this PR seeks to remove.

Note we won't need to re-add libarrow_python, we just need to expose the version of arrow that perspective statically linked against into python and we can do the check purely in python at load time. If you're passing in a pyarrow.Table (e.g. pyarrow in sys.modules) we can do a version check. Right now we treat any bytes as arrow which should probably have these types of checks anyway. I assume the version is encoded in the table binary in some form as well, so we can even use that

timkpaine added a commit that referenced this pull request Jan 17, 2021
@sc1f
Copy link
Contributor Author

sc1f commented Jan 17, 2021

@sc1f this also seems to break Mac or linux when lib binding looks for libpsp

I think I might have removed the $ORIGIN from the rpath setting when I was cutting out the old pyarrow - will take a look and fix.

@timkpaine
Copy link
Member

I've fixed on my jlab 3 PR

@texodus
Copy link
Member

texodus commented Jan 18, 2021

@timkpaine pyarrow is not imported by perspective-python, and perspective.Table() doesn't support pyarrow.Table as a constructor argument (unless there is some Python coercion coincidence I am unaware of ..). As I stated above, arrow::Table does validation on blobs, as it should since we don't even know what ranges to validate without reading the Arrow docs like savages.

Just to re-emphasize - this PR changes no python, it only removes the dependency shared libarrow which came packaged with pyarrow (e.g. the non-Python part). Or as stated in the 3rd paragraph of the PR:

At present, Perspective does not import pyarrow or use any of Arrow's Python-specific C++ code. If we were to use PyArrow in the Python runtime, however (like in the test suite), Perspective's C++ version of Arrow will continue to work as expected—perspective-python will use its own version of Arrow, and pyarrow will use the libarrow.so that shipped with the PyArrow install. This means further integration with PyArrow is not blocked/degraded, and if anything should be accelerated as we now have a consistent Arrow version across all Perspective runtimes.

@timkpaine
Copy link
Member

So long as the arrow validation doesn't trigger un catchable errors in python, it should be a straight improvement.

If you're shoving an arrow into perspective in python, you're probably using pyarrow. Right now we go pyarrow table -> byte string copy -> perspective (2 copies total). We can no longer do pyarrow table -> no copy arrow table -> perspective (1 copy total), but I think it's ok given the better install experience (and we never implemented it anyway).

So the only negative is that if you're shoving an arrow from your version of pyarrow into our version of arrow and it's incompatible, you get an obscure error. We should make it a PerspectiveCPPException or whatever we called it.

@texodus
Copy link
Member

texodus commented Jan 18, 2021

#1157

@timkpaine
Copy link
Member

you mean this comment? #1157 (comment)

@sc1f
Copy link
Contributor Author

sc1f commented Jan 19, 2021

So the only negative is that if you're shoving an arrow from your version of pyarrow into our version of arrow and it's incompatible, you get an obscure error. We should make it a PerspectiveCPPException or whatever we called it.

IIRC it should be a PerspectiveCppException already, although I agree the "Expected to read..." error is a little obscure. We have more detailed error messages when we check the status of our arrow reads during the load, so it reports more verbose messages to tell you which columns failed, whether it could read the recordbatch, etc. It seems like this "Expected to read..." error is coming from the Arrow C++ lib itself, so it would be harder to transform that message.

I saw the fix for rpath on #1294 - I definitely was heavy-handed in removing PyArrow-related code from our CMakeLists, and cut out the "set rpath to $ORIGIN" block as well. Thanks for the fix!

@timkpaine
Copy link
Member

Yep because we expose PerspectiveCppException into python you can try: except it, whereas an error out of Arrow C++ is not going to be exceptable. Worse case we just try:catch in C++ and in the catch handler bubble it up into python

@sc1f
Copy link
Contributor Author

sc1f commented Jan 19, 2021

My recollection of how we set up PyBind is that all exceptions from C++ get raised as PerspectiveCppException in Python - let me get the old version of superstore.arrow from @texodus. I tried the oldest version I could find from May 2019 and it worked fine.

zepaz pushed a commit to zepaz/perspective that referenced this pull request Jan 22, 2021
zepaz added a commit to zepaz/perspective that referenced this pull request Jan 22, 2021
zepaz added a commit to zepaz/perspective that referenced this pull request Jan 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal Internal refactoring and code quality improvement Python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants