Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyarrow Segfault Fix #7568

Merged
merged 13 commits into from
Mar 29, 2020
Merged

Pyarrow Segfault Fix #7568

merged 13 commits into from
Mar 29, 2020

Conversation

ijrsvt
Copy link
Contributor

@ijrsvt ijrsvt commented Mar 11, 2020

Why are these changes needed?

Ray will segfault if pyarrow is imported before it because exported symbol are colliding . This fixes that bug and adds a test to ensure that future changes do not reintroduce it.

Related issue number

Fixes Issue #7393

Checks

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@ijrsvt ijrsvt requested a review from pcmoritz March 11, 2020 22:33
@pcmoritz
Copy link
Contributor

pcmoritz commented Mar 11, 2020

The test script might be a little bit too clever (calling itself), have you considered doing python -c "import ray; import pyarrow" and vice versa?

We should also change the comment that contain RTLD_GLOBAL so they are in sync with the code.

Other than that it looks good to me!

@raulchen Can you check if there is any impact on the streaming system?

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -37,7 +37,7 @@
if os.path.exists(so_path):
import ctypes
from ctypes import CDLL
CDLL(so_path, ctypes.RTLD_GLOBAL)
CDLL(so_path, ctypes.RTLD_LOCAL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably leave a comment here pointing to the issue this fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely will add!

Copy link
Member

@chaokunyang chaokunyang Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ijrsvt Could you give an example which symbols conflict? We used linker script ray_exported_symbols.lds and ray_version_script.lds to limit exported symbols. And using RTLD_GLOBAL by purpose so that _streaming.so can using symbols in _raylet.so

Copy link
Contributor Author

@ijrsvt ijrsvt Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conflicting symbols were in the GRPC library. The problem only showed up on Linux, and when pyarrow was imported before ray. The specific symbol that was segfaulting was: google::protobuf::internal::AssignDescriptors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give more complete detailed log for this symbol segfault. I tink we didn't expose protobuf symbols in ray_exported_symbols.lds/ray_version_script.lds. And this import order issue exists before RTLD_GLOBAL.

Copy link
Contributor Author

@ijrsvt ijrsvt Mar 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is here It appears to be some issue where there is a conflicting symbol from gRPC. A similar issue is here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaokunyang Can you look more into this and why the symbols are conflicting despite the version linker script? This is blocking the current release I believe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcmoritz I'm looking into it. There may be some symbols exported unexpectedly despite of the version linker script, like template. If that's the reason, maybe we can use __attribute__((visibility(....))). If not, we'll need more time for this.

ci/travis/check_symbol_collisions.py Outdated Show resolved Hide resolved
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23049/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23050/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23064/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23106/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23111/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23117/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23119/
Test FAILed.

@chaokunyang
Copy link
Member

Hi @ijrsvt , I used Python 3.7.5, ray 0.8.2, pyarrow 0.16.0 in a Ubuntu 19.10 docker container. I can't reproduce your issue.
20200325120420
Could you recheck again that it's not a problem caused by other lib?

I also searched symbols in _raylet.so using nm -g _raylet.so | grep AssignDescriptors. But didn't find symbols related to AssignDescriptors.

I test ray 0.82 with pyarrow 0.16.0/0.15.0. Both of all works well. But when I use pyarrow 0.14.0, here is a segment fault. I think it's a issue of pyarrow?

@ijrsvt
Copy link
Contributor Author

ijrsvt commented Mar 27, 2020

@chaokunyang I can't reproduce it either (and nor can the original bug reporter). I reverted the RTLD_LOCAL change.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23816/
Test PASSed.

@pcmoritz pcmoritz merged commit 57599f0 into ray-project:master Mar 29, 2020
@pcmoritz
Copy link
Contributor

@ijrsvt After merging the PR it is causing an error, can you look at this?

pcmoritz added a commit that referenced this pull request Mar 30, 2020
"""
import subprocess

TESTED_LIBRARIES = ["pyarrow"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mildly surprised that this test passes in our CI. Is pyarrow installed in our CI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't fail if pyarrow is installed, only if there is some strange symbol collision. There was initially a fix to limit the scope of exported symbols, but that was removed after it became un-reproducable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I agree it doesn't fail if pyarrow is installed. I just thought that pyarrow isn't installed in our Travis tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh okay! Let me check that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added Pyarrow to the dependencies. It looks like the error is a segfault again :/
https://travis-ci.com/github/ray-project/ray/jobs/308220125

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this pyarrow version 0.14?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is 0.16. I added the version to the debug info, so it shows up in this test.

robertnishihara pushed a commit that referenced this pull request Mar 30, 2020


def test_imports():
def try_imports(library1, library2):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ijrsvt slightly simpler way to do this (and I think will give a better error message)

subprocess.check_output(["python", "-c", "import {}; import {}".format(library1, library2)]) 
subprocess.check_output(["python", "-c", "import {}; import {}".format(library2, library1)]) 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh okay. I'll switch to this!

ijrsvt added a commit to ijrsvt/ray that referenced this pull request Mar 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants