Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Use Graphscope session to deploy on k8s faild when config num_workers=2 #2479

Closed
JackyYangPassion opened this issue Mar 1, 2023 · 9 comments

Comments

@JackyYangPassion
Copy link
Contributor

JackyYangPassion commented Mar 1, 2023

Describe the bug
use graphscope session to deploy on k8s faild

To Reproduce

session = graphscope.session(
       k8s_coordinator_cpu=1,
       k8s_coordinator_mem="1Gi",
       k8s_vineyard_cpu=0.2,
       k8s_vineyard_mem="1Gi",
       vineyard_shared_mem="2Gi",
       k8s_engine_cpu=0.2,
       k8s_engine_mem="1Gi",
       num_workers=2,
       k8s_namespace='gstest008',
       k8s_client_config={"config_file": '/usr/local/airflow/dags/k8s_config'})

log

An internal error has occurred in ORTE:
[[64088,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(359)
This is something that should be reported to the developers.
--------------------------------------------------------------------------
[gs-engine-tuivgx-1:00058] [[64088,0],2] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 355
[gs-engine-tuivgx-0:00078] [[64088,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 355
[coordinator-tuivgx-568c7c78d-sgw6t:00052] 1 more process has sent help message help-errmgr-base.txt / simple-message
[coordinator-tuivgx-568c7c78d-sgw6t:00052] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2023-03-01 15:47:02,102 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 2 time
2023-03-01 15:47:02,102 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:05,106 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 3 time
2023-03-01 15:47:05,107 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:08,111 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 4 time
2023-03-01 15:47:08,111 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:11,115 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 5 time
2023-03-01 15:47:11,115 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:14,119 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 6 time
2023-03-01 15:47:14,120 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:17,124 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 7 time
2023-03-01 15:47:17,124 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:20,129 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 8 time
2023-03-01 15:47:20,130 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:23,134 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 9 time
2023-03-01 15:47:23,135 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:26,139 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 10 time
2023-03-01 15:47:26,140 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:29,144 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 11 time
2023-03-01 15:47:29,144 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:32,149 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 12 time
2023-03-01 15:47:32,149 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:35,151 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 13 time
2023-03-01 15:47:35,152 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:38,157 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 14 time
2023-03-01 15:47:38,157 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:41,162 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 15 time
2023-03-01 15:47:41,162 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:44,167 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 16 time
2023-03-01 15:47:44,168 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:47,172 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 17 time
2023-03-01 15:47:47,173 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:50,177 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 18 time
2023-03-01 15:47:50,177 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:53,179 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 19 time
2023-03-01 15:47:53,179 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:56,183 [INFO][coordinator:519]: Clean up resources, cleanup_instance: True, is_dangling: False
2023-03-01 15:47:56,183 [INFO][kubernetes_launcher:693]: Cleaning up kubernetes resources
2023-03-01 15:47:56,695 [INFO][kubernetes_launcher:731]: Kubernetes launcher stopped

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • GraphScope version: v0.19.0
  • OS: linux
  • Kubernetes Version: 1.25.4
@siyuan0322
Copy link
Collaborator

siyuan0322 commented Mar 2, 2023

What's the image version and the client version?

And it seems the openmpi failed to start. Are there more messages before An internal error has occurred in ORTE:?

@JackyYangPassion
Copy link
Contributor Author

JackyYangPassion commented Mar 2, 2023

What's the image version and the client version?

  • client version: 0.19.0

  • image version:
    registry.cn-hongkong.aliyuncs.com/graphscope/coordinator:0.19.0
    registry.cn-hongkong.aliyuncs.com/graphscope/analytical:0.19.0
    registry.cn-hongkong.aliyuncs.com/graphscope/interactive-executor:0.19.0
    registry.cn-hongkong.aliyuncs.com/graphscope/learning:0.19.0
    vineyardcloudnative/vineyardd:v0.11.7

And it seems the openmpi failed to start. Are there more messages before An internal error has occurred in ORTE:?
coordinator log

2023-03-02 02:57:43,782 [INFO][coordinator:970]: Start server with args Namespace(cluster_type='k8s', dangling_timeout_seconds=600, etcd_addrs=None, etcd_listening_client_port=2379, etcd_listening_peer_port=2380, hosts='localhost', instance_id='gtmrxh', k8s_coordinator_name='coordinator-gtmrxh', k8s_coordinator_service_name='coordinator-gtmrxh', k8s_delete_namespace=False, k8s_engine_cpu=3.0, k8s_engine_mem='2Gi', k8s_engine_pod_node_selector='', k8s_image_pull_policy='IfNotPresent', k8s_image_pull_secrets='', k8s_image_registry='registry.cn-hongkong.aliyuncs.com', k8s_image_repository='graphscope', k8s_image_tag='0.19.0', k8s_mars_scheduler_cpu=0.2, k8s_mars_scheduler_mem='2Mi', k8s_mars_worker_cpu=0.2, k8s_mars_worker_mem='4Mi', k8s_namespace='gs-new-orc', k8s_service_type='NodePort', k8s_vineyard_cpu=0.5, k8s_vineyard_daemonset=None, k8s_vineyard_image='vineyardcloudnative/vineyardd:v0.11.7', k8s_vineyard_mem='512Mi', k8s_volumes='', k8s_with_analytical=True, k8s_with_analytical_java=False, k8s_with_dataset=False, k8s_with_interactive=True, k8s_with_learning=True, k8s_with_mars=False, log_level='DEBUG', monitor=False, monitor_port=9968, num_workers=2, port=59788, preemptive=True, timeout_seconds=600, vineyard_shared_mem='2Gi', vineyard_socket=None, waiting_for_delete=False)
2023-03-02 02:57:43,783 [INFO][coordinator:972]: Coordinator server listen at 0.0.0.0:59788
2023-03-02 02:57:46,438 [INFO][kubernetes_launcher:405]: Create engine headless services...
2023-03-02 02:57:46,451 [INFO][kubernetes_launcher:410]: Creating engine pods...
2023-03-02 02:57:46,475 [INFO][kubernetes_launcher:419]: Creating frontend pods...
2023-03-02 02:57:46,492 [INFO][kubernetes_launcher:435]: Creating vineyard service...
2023-03-02 02:57:46,589 [INFO][kubernetes_launcher:472]: Waiting for services ready...
2023-03-02 02:57:46,658 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Successfully assigned gs-new-orc/gs-engine-gtmrxh-0 to k8s-worker1
2023-03-02 02:57:49,824 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Container image "registry.cn-hongkong.aliyuncs.com/graphscope/analytical:0.19.0" already present on machine
2023-03-02 02:57:49,825 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Created container engine
2023-03-02 02:57:49,827 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Started container engine
2023-03-02 02:57:49,828 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Container image "registry.cn-hongkong.aliyuncs.com/graphscope/interactive-executor:0.19.0" already present on machine
2023-03-02 02:57:49,830 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Created container executor
2023-03-02 02:57:49,831 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Started container executor
2023-03-02 02:57:49,833 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Container image "registry.cn-hongkong.aliyuncs.com/graphscope/learning:0.19.0" already present on machine
2023-03-02 02:57:49,834 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Created container learning
2023-03-02 02:57:49,835 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Started container learning
2023-03-02 02:57:49,837 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Container image "vineyardcloudnative/vineyardd:v0.11.7" already present on machine
2023-03-02 02:57:49,838 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Created container vineyard
2023-03-02 02:57:49,840 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Started container vineyard
2023-03-02 02:57:56,898 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Successfully assigned gs-new-orc/gs-engine-gtmrxh-1 to k8s-worker2
2023-03-02 02:57:56,900 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Container image "registry.cn-hongkong.aliyuncs.com/graphscope/analytical:0.19.0" already present on machine
2023-03-02 02:57:56,901 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Created container engine
2023-03-02 02:57:56,903 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Started container engine
2023-03-02 02:57:56,905 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Container image "registry.cn-hongkong.aliyuncs.com/graphscope/interactive-executor:0.19.0" already present on machine
2023-03-02 02:57:56,907 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Created container executor
2023-03-02 02:57:56,909 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Started container executor
2023-03-02 02:57:56,910 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Container image "registry.cn-hongkong.aliyuncs.com/graphscope/learning:0.19.0" already present on machine
2023-03-02 02:57:56,912 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Created container learning
2023-03-02 02:57:56,913 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Started container learning
2023-03-02 02:57:56,915 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Container image "vineyardcloudnative/vineyardd:v0.11.7" already present on machine
2023-03-02 02:57:56,916 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Created container vineyard
2023-03-02 02:57:56,918 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Started container vineyard
2023-03-02 02:57:59,961 [INFO][kubernetes_launcher:546]: GraphScope engines pod is ready.
2023-03-02 02:57:59,981 [INFO][kubernetes_launcher:547]: Engines pod name list: ['gs-engine-gtmrxh-0', 'gs-engine-gtmrxh-1']
2023-03-02 02:57:59,981 [INFO][kubernetes_launcher:548]: Engines pod ip list: ['10.244.2.179', '10.244.1.185']
2023-03-02 02:57:59,982 [INFO][kubernetes_launcher:549]: Engines pod host ip list: ['172.31.59.21', '172.31.48.79']
2023-03-02 02:57:59,982 [INFO][kubernetes_launcher:550]: Vineyard service endpoint: 172.31.59.21:32272
2023-03-02 02:57:59,987 [INFO][kubernetes_launcher:570]: Starting GAE rpc service on 10.244.2.179:56975 ...
2023-03-02 02:58:00,836 [DEBUG][utils:1896]: Resolve mpi cmd prefix: /opt/openmpi/bin/mpirun --allow-run-as-root --bind-to none -n 2 -host gs-engine-gtmrxh-0:1.0,gs-engine-gtmrxh-1:1.0
2023-03-02 02:58:00,838 [DEBUG][utils:1897]: Resolve mpi env: {"OMPI_MCA_btl_vader_single_copy_mechanism": "none", "OMPI_MCA_orte_allowed_exit_without_sync": "1", "OMPI_MCA_odls_base_sigkill_timeout": "0", "OMPI_MCA_plm_rsh_agent": "/usr/local/bin/kube_ssh"}
2023-03-02 02:58:00,839 [INFO][kubernetes_launcher:598]: Analytical engine launching command: /opt/openmpi/bin/mpirun --allow-run-as-root --bind-to none -n 2 -host gs-engine-gtmrxh-0:1.0,gs-engine-gtmrxh-1:1.0 /opt/graphscope/bin/grape_engine --host 0.0.0.0 --port 56975 -v 10 --vineyard_socket /tmp/vineyard_workspace/vineyard.sock
2023-03-02 02:58:00,891 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 0 time
2023-03-02 02:58:00,892 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-02 02:58:03,895 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 1 time
2023-03-02 02:58:03,895 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
10.244.2.179 gs-engine-gtmrxh-0
10.244.1.185 gs-engine-gtmrxh-1
10.244.2.179 gs-engine-gtmrxh-0
10.244.1.185 gs-engine-gtmrxh-1

@siyuan0322
Copy link
Collaborator

Couldn't reproduce using same version and same commands, except for the k8s_client_config.
How does your k8s cluster setup? Is it by kubeadm, k3s, or by cloud provider? Could it be reproduced in your environments?

BTW, I'm interested in your user case, as it seems that you incorporated graphscope into your airflow dag? could you share more details about your target scenarios, or the graph problems you want to solve with us? cc @sighingnow

@JackyYangPassion
Copy link
Contributor Author

JackyYangPassion commented Mar 6, 2023

How does your k8s cluster setup? Is it by kubeadm, k3s, or by cloud provider? Could it be reproduced in your environments?

kubeadm

@JackyYangPassion
Copy link
Contributor Author

BTW, I'm interested in your user case, as it seems that you incorporated graphscope into your airflow dag? could you share more details about your target scenarios, or the graph problems you want to solve with us?

Mainly used to run GIE + GAE, analyze the offline graph, and want to use GraphScope as the basic computing engine!

@siyuan0322
Copy link
Collaborator

I suspect it's a cluster network issue...
I searched for similar issues and found out that it may be caused by different hosts architecture open-mpi/ompi#4437
Could you confirm it?
Does GS with 1 workers works normally?

@JackyYangPassion
Copy link
Contributor Author

Does GS with 1 workers works normally?

yes it's running ok

@siyuan0322
Copy link
Collaborator

Specifing a newer tag: k8s_image_tag='0.20.0a20230306' will probably fix this.
And we will release 0.20.0 in a few days.

@sighingnow
Copy link
Collaborator

Resolved by #2341 and the root cause is a small difference in openmpi configuration between coordinator and analytical docker images.

The patch will be delivered in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants