Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] DebMetadataTests test05CheckLintian failing #85691

Closed
albertzaharovits opened this issue Apr 5, 2022 · 7 comments · Fixed by elastic/ml-cpp#2255
Closed

[CI] DebMetadataTests test05CheckLintian failing #85691

albertzaharovits opened this issue Apr 5, 2022 · 7 comments · Fixed by elastic/ml-cpp#2255
Assignees
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@albertzaharovits
Copy link
Contributor

albertzaharovits commented Apr 5, 2022

Build scan:
https://gradle-enterprise.elastic.co/s/hx2aguhh5kmfi/tests/:qa:os:destructiveDistroTest.default-deb/org.elasticsearch.packaging.test.DebMetadataTests/test05CheckLintian

Reproduction line:
null

Applicable branches:
ONLY main

Reproduces locally?:
Didn't try

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.packaging.test.DebMetadataTests&tests.test=test05CheckLintian

Failure excerpt:

org.elasticsearch.packaging.util.Shell$ShellException: Command was not successful: [bash -c lintian --fail-on-warnings /var/lib/jenkins/workspace/elastic+elasticsearch+main+multijob+packaging-tests-unix/os/ubuntu-18.04-packaging/distribution/packages/deb/build/distributions/elasticsearch-8.3.0-SNAPSHOT-amd64.deb]
   result: exitCode = [1] stdout = [W: elasticsearch: shlib-with-executable-stack usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/lib/libtorch_cpu.so
N: 403 tags overridden (189 errors, 171 warnings, 43 info)] stderr = [warning: the authors of lintian do not recommend running it with root privileges!
warning: --fail-on-warnings is deprecated]

  at __randomizedtesting.SeedInfo.seed([E4AA662FB0B3F8ED:2454FAC30AC79D56]:0)
  at org.elasticsearch.packaging.util.Shell.runScript(Shell.java:143)
  at org.elasticsearch.packaging.util.Shell.run(Shell.java:73)
  at org.elasticsearch.packaging.test.DebMetadataTests.test05CheckLintian(DebMetadataTests.java:38)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
  at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
  at java.lang.Thread.run(Thread.java:833)

@albertzaharovits albertzaharovits added :Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts >test-failure Triaged test failures from CI labels Apr 5, 2022
@elasticmachine elasticmachine added the Team:Delivery Meta label for Delivery team label Apr 5, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@albertzaharovits
Copy link
Contributor Author

FWIW: happens on ubuntu-18.04, debian10, and debian11

elasticsearchmachine pushed a commit that referenced this issue Apr 5, 2022
@droberts195 droberts195 added :ml Machine learning and removed :Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts labels Apr 7, 2022
@elasticmachine elasticmachine added Team:ML Meta label for the ML team and removed Team:Delivery Meta label for Delivery team labels Apr 7, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Contributor

This is happening as a result of upgrading PyTorch from version 1.9 to version 1.11 in elastic/ml-cpp#2238.

On x86_64 the library libtorch_cpu.so really does have an executable stack now:

$ readelf -a libtorch_cpu.so | grep -C 2 GNU_STACK
  GNU_EH_FRAME   0x0000000006ff0434 0x0000000006ff0434 0x0000000006ff0434
                 0x0000000000183bcc 0x0000000000183bcc  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RWE    0x10
  GNU_RELRO      0x0000000007db4400 0x0000000007db5400 0x0000000007db5400

(It's the E in RWE.)

This is not the case on aarch64:

$ readelf -a libtorch_cpu.so | grep -C 2 GNU_STACK
  GNU_EH_FRAME   0x0000000003ddb188 0x0000000003ddb188 0x0000000003ddb188
                 0x00000000000fa60c 0x00000000000fa60c  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10
  GNU_RELRO      0x0000000004722370 0x0000000004732370 0x0000000004732370

There have been other cases in the past where PyTorch has needed to make changes to avoid an executable stack: pytorch/pytorch@639d1c7 and pytorch/pytorch@0e9613c

We need to try to find out which source file within PyTorch is causing this with PyTorch 1.11 and whether it's a mistake or deliberate this time.

PyTorch 1.11 is only in the main branch, so we have until 8.3.0 feature freeze to figure this out or else revert back to PyTorch 1.9.

@droberts195
Copy link
Contributor

droberts195 commented Apr 7, 2022

The same problem exists in the pre-built binary that you can download from https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.11.0%2Bcpu.zip.

$ readelf -a libtorch_cpu.so | grep -C 2 GNU_STACK
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x00000000165f8000 0x00000000165f8000  R E    0x1000
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RWE    0x10
  GNU_EH_FRAME   0x00000000163f3e3c 0x00000000163f3e3c 0x00000000163f3e3c

So it's not that we made a silly mistake when building it from source. The official download for 1.11 also has an executable stack.

I also downloaded https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.10.0%2Bcpu.zip and https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.9.0%2Bcpu.zip and found that 1.10 also has an executable stack but 1.9 doesn't. So it was a change between 1.9 and 1.10 that caused this.

@droberts195 droberts195 self-assigned this Apr 11, 2022
droberts195 added a commit to droberts195/ml-cpp that referenced this issue Apr 11, 2022
Breakpad causes the libtorch_cpu.so library to have an
executable stack, which is undesirable.

Fixes elastic/elasticsearch#85691
@droberts195
Copy link
Contributor

A bit of detective work using readelf shows that the PyTorch object file that doesn't contain a .note.GNU-stack section and hence results in an executable stack is build/third_party/breakpad/CMakeFiles/breakpad_common.dir/src/common/linux/breakpad_getcontext.S.o.

Breakpad was added in pytorch/pytorch#63186 which explains why this problem is in 1.10 but not 1.9.

Also, the fact that Breakpad doesn't compile on Aarch64 explains why we don't get the executable stack problem on AArch64. I had to exclude it from the build on that architecture - see pytorch/pytorch#67083 and https://github.com/elastic/ml-cpp/pull/2238/files#diff-88435d0cdd0bfef440a7483f80bc6e75cba43032104f6cd5332049817e837c10R328.

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Apr 11, 2022
The test was muted due to elastic#85691.

The underlying problem should be fixed by
elastic/ml-cpp#2255.

This PR needs to be merged after the new artifacts
created from merging elastic/ml-cpp#2255 have been
uploaded to S3.
droberts195 added a commit to elastic/ml-cpp that referenced this issue Apr 11, 2022
Breakpad causes the libtorch_cpu.so library to have an
executable stack, which is undesirable.

Fixes elastic/elasticsearch#85691
droberts195 added a commit that referenced this issue Apr 12, 2022
The test was muted due to #85691.

The underlying problem should be fixed by
elastic/ml-cpp#2255.

This PR needs to be merged after the new artifacts
created from merging elastic/ml-cpp#2255 have been
uploaded to S3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants