Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64 CentOS7 compatibility issues with djl/pytorch due to glibc requirements #2563

Open
peterzhuamazon opened this issue Jun 17, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@peterzhuamazon
Copy link
Member

peterzhuamazon commented Jun 17, 2024

We are having issues in ml-commons on arm64, where a lib related to pytorch is requiring glibc >= 2.18

/opt/java/openjdk-21/bin/java: relocation error: /tmp/tmpfv4oghde/1/local-test-cluster/opensearch-2.15.0/data/ml_cache/pytorch/1.13.1-cpu-precxx11-linux-aarch64/libstdc++.so.6: symbol __cxa_thread_atexit_impl, version GLIBC_2.18 not defined in file libc.so.6 with link time reference

https://ci.opensearch.org/ci/dbc/integ-test/2.15.0/9970/linux/arm64/tar/test-results/8297/integ-test/neural-search/without-security/local-cluster-logs/id-1/stderr.txt
https://ci.opensearch.org/ci/dbc/integ-test/2.15.0/9970/linux/arm64/tar/test-results/8297/integ-test/neural-search/without-security/stderr.txt

Note that we are using CentOS7 to build and test OS plugins, which has glibc 2.17 after all.
This issue would cause the cluster to crash, resulted in integTest suck in the middle with connection reset.
This has impacted ml and ml related plugins such as ml/neural/flowframework to fail their tests.
And this has been an issue on arm64 TAR since 2.12 as we trace the logs all the way back.

CentOS7 is going to deprecate on 06/30 and this shouldnt be a problem for AL2 as AL2 has gblic 2.28.

We will switch to AL2 on 2.16 anyway due to k-NN. opensearch-project/opensearch-build#4379

Note: This has affected ML, Flow-Framework, Neural-Search.

Thanks.

@peterzhuamazon peterzhuamazon added the v2.16.0 Issues targeting release v2.16.0 label Jun 17, 2024
@peterzhuamazon
Copy link
Member Author

http://djl.ai/engines/pytorch/pytorch-engine/#for-pre-cxx11-build
In djl documentation it seems that they are supporting CentOS7 but glibc version on it is requiring for 2.18 which is not availalble on CentOS7.

Thanks.

@peterzhuamazon
Copy link
Member Author

peterzhuamazon commented Jun 18, 2024

Hi @ylwu-amzn Could you help check if this is a test time only issue, or runtime issue when deploying actual models as well?

Thanks!

@peterzhuamazon
Copy link
Member Author

peterzhuamazon commented Jun 18, 2024

After extensive debugging with the DJL team and ML team , we have discovered a potential workaround:

DJL version 0.28 allows for override of libstdc++ path with LIBSTDCXX_LIBRARY_PATH, but not available in the 0.21 version. However, bumping everything to 0.28 version would cause another glibc issue due to tokenizer 0.28 does not support glibc 2.17.

Caused by: java.lang.UnsatisfiedLinkError: /home/ci-runner/opensearch-2.15.0/data/ml_cache/tokenizers/0.19.1-0.28.0-linux-aarch64/libtokenizers.so: /lib64/libc.so.6: version `GLIBC_2.25' not found (required by /home/ci-runner/opensearch-2.15.0/data/ml_cache/tokenizers/0.19.1-0.28.0-linux-aarch64/libtokenizers.so)

The solution is to have DJL lock to 0.28 but tokenizer lock to 0.21:


    implementation platform("ai.djl:bom:0.28.0")
    implementation group: 'ai.djl.pytorch', name: 'pytorch-model-zoo', version: '0.28.0'
    implementation("ai.djl:api:0.28.0!!")
    implementation("ai.djl.huggingface:tokenizers:0.21.0!!")
env LIBSTDCXX_LIBRARY_PATH=/usr/lib64/libstdc++.so.6 ./gradlew integTest --tests "org.opensearch.ml.rest.RestMLDeployModelActionIT.testReDeployModel" -Dtests.seed=9B03622482185229 -Dopensearch.version=2.15.0 -Dbuild.snapshot=false -Dtests.rest.cluster="localhost:9200" -Dtests.cluster="localhost:9200" -Dtests.clustername="opensearch" -Dhttps=true -Duser=admin -Dpassword=myStrongPassword123! --console=plain


> Task :opensearch-ml-plugin:integTest
Jun 18, 2024 1:53:34 AM sun.util.locale.provider.LocaleProviderAdapter <clinit>
WARNING: COMPAT locale provider will be removed in a future release

org.opensearch.ml.rest.RestMLDeployModelActionIT > testReDeployModel STANDARD_OUT
    [2024-06-17T18:53:37,076][INFO ][o.o.m.r.RestMLDeployModelActionIT] [testReDeployModel] before test
    [2024-06-17T18:53:37,305][INFO ][o.o.m.r.RestMLDeployModelActionIT] [testReDeployModel] initializing REST clients against [https://localhost:9200]
    [2024-06-17T18:53:47,815][INFO ][o.o.m.r.RestMLDeployModelActionIT] [testReDeployModel] Re-Deploy model {model_id=3rIMKZABJulLC2KPvlPA, task_type=DEPLOY_MODEL, function_name=TEXT_EMBEDDING, state=CREATED, worker_node=[jeC7OP6XSwm2iLz1xGGALQ], create_time=1.718675627779E12, last_update_time=1.718675627779E12, is_async=true}
    [2024-06-17T18:53:48,101][INFO ][o.o.m.r.RestMLDeployModelActionIT] [testReDeployModel] Get Model after re-deploy {name=test_model_name, model_group_id=3LIMKZABJulLC2KPvVMr, algorithm=TEXT_EMBEDDING, model_version=1, model_format=TORCH_SCRIPT, model_state=DEPLOYED, model_content_size_in_bytes=4554671.0, model_content_hash_value=e13b74006290a9d0f58c1376f9629d4ebc05a0f9385f40db837452b167ae9021, model_config={model_type=bert, embedding_dimension=768.0, framework_type=SENTENCE_TRANSFORMERS}, created_time=1.718675619412E12, last_updated_time=1.718675628048E12, last_registered_time=1.718675620837E12, last_deployed_time=1.718675628048E12, auto_redeploy_retry_times=0.0, total_chunks=1.0, planning_worker_node_count=1.0, current_worker_node_count=1.0, planning_worker_nodes=[jeC7OP6XSwm2iLz1xGGALQ], deploy_to_all_nodes=true, is_hidden=false}
    [2024-06-17T18:53:48,842][INFO ][o.o.m.r.RestMLDeployModelActionIT] [testReDeployModel] after test

> Task :opensearch-ml-spi:compileTestJava NO-SOURCE
> Task :opensearch-ml-spi:processTestResources NO-SOURCE
> Task :opensearch-ml-spi:testClasses UP-TO-DATE
> Task :opensearch-ml-spi:test NO-SOURCE
> Task :opensearch-ml-spi:integTest NO-SOURCE

Deprecated Gradle features were used in this build, making it incompatible with Gradle 9.0.

You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.

For more on this, please refer to https://docs.gradle.org/8.4/userguide/command_line_interface.html#sec:command_line_warnings in the Gradle documentation.

BUILD SUCCESSFUL in 31s
24 actionable tasks: 1 executed, 23 up-to-date

Thanks.

@ylwu-amzn
Copy link
Collaborator

ylwu-amzn commented Jun 18, 2024

The cons of the workaround is we will use mixed version which adds maintenance effort. Considering this issue exists for a long time but no one reports issue, and Centos 7 will be deprecated, would suggest not add such workaround .

From @peterzhuamazon , the CentOS7 X64 passed, just ARM64 failed.

@rbhavna rbhavna moved this to In Progress in ml-commons projects Jun 18, 2024
@b4sjoo b4sjoo added bug Something isn't working and removed v2.16.0 Issues targeting release v2.16.0 labels Jul 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

No branches or pull requests

3 participants