[BUG] Use Graphscope session to deploy on k8s faild when config num_workers=2 #2479

JackyYangPassion · 2023-03-01T15:55:56Z

Describe the bug
use graphscope session to deploy on k8s faild

To Reproduce

session = graphscope.session(
       k8s_coordinator_cpu=1,
       k8s_coordinator_mem="1Gi",
       k8s_vineyard_cpu=0.2,
       k8s_vineyard_mem="1Gi",
       vineyard_shared_mem="2Gi",
       k8s_engine_cpu=0.2,
       k8s_engine_mem="1Gi",
       num_workers=2,
       k8s_namespace='gstest008',
       k8s_client_config={"config_file": '/usr/local/airflow/dags/k8s_config'})

log

An internal error has occurred in ORTE:
[[64088,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(359)
This is something that should be reported to the developers.
--------------------------------------------------------------------------
[gs-engine-tuivgx-1:00058] [[64088,0],2] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 355
[gs-engine-tuivgx-0:00078] [[64088,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 355
[coordinator-tuivgx-568c7c78d-sgw6t:00052] 1 more process has sent help message help-errmgr-base.txt / simple-message
[coordinator-tuivgx-568c7c78d-sgw6t:00052] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2023-03-01 15:47:02,102 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 2 time
2023-03-01 15:47:02,102 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:05,106 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 3 time
2023-03-01 15:47:05,107 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:08,111 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 4 time
2023-03-01 15:47:08,111 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:11,115 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 5 time
2023-03-01 15:47:11,115 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:14,119 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 6 time
2023-03-01 15:47:14,120 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:17,124 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 7 time
2023-03-01 15:47:17,124 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:20,129 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 8 time
2023-03-01 15:47:20,130 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:23,134 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 9 time
2023-03-01 15:47:23,135 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:26,139 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 10 time
2023-03-01 15:47:26,140 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:29,144 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 11 time
2023-03-01 15:47:29,144 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:32,149 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 12 time
2023-03-01 15:47:32,149 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:35,151 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 13 time
2023-03-01 15:47:35,152 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:38,157 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 14 time
2023-03-01 15:47:38,157 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:41,162 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 15 time
2023-03-01 15:47:41,162 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:44,167 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 16 time
2023-03-01 15:47:44,168 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:47,172 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 17 time
2023-03-01 15:47:47,173 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:50,177 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 18 time
2023-03-01 15:47:50,177 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:53,179 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 19 time
2023-03-01 15:47:53,179 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-01 15:47:56,183 [INFO][coordinator:519]: Clean up resources, cleanup_instance: True, is_dangling: False
2023-03-01 15:47:56,183 [INFO][kubernetes_launcher:693]: Cleaning up kubernetes resources
2023-03-01 15:47:56,695 [INFO][kubernetes_launcher:731]: Kubernetes launcher stopped

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

GraphScope version: v0.19.0
OS: linux
Kubernetes Version: 1.25.4

The text was updated successfully, but these errors were encountered:

siyuan0322 · 2023-03-02T01:41:26Z

What's the image version and the client version?

And it seems the openmpi failed to start. Are there more messages before An internal error has occurred in ORTE:?

JackyYangPassion · 2023-03-02T03:00:22Z

What's the image version and the client version?

client version: 0.19.0
image version:
registry.cn-hongkong.aliyuncs.com/graphscope/coordinator:0.19.0
registry.cn-hongkong.aliyuncs.com/graphscope/analytical:0.19.0
registry.cn-hongkong.aliyuncs.com/graphscope/interactive-executor:0.19.0
registry.cn-hongkong.aliyuncs.com/graphscope/learning:0.19.0
vineyardcloudnative/vineyardd:v0.11.7

And it seems the openmpi failed to start. Are there more messages before An internal error has occurred in ORTE:?
coordinator log

2023-03-02 02:57:43,782 [INFO][coordinator:970]: Start server with args Namespace(cluster_type='k8s', dangling_timeout_seconds=600, etcd_addrs=None, etcd_listening_client_port=2379, etcd_listening_peer_port=2380, hosts='localhost', instance_id='gtmrxh', k8s_coordinator_name='coordinator-gtmrxh', k8s_coordinator_service_name='coordinator-gtmrxh', k8s_delete_namespace=False, k8s_engine_cpu=3.0, k8s_engine_mem='2Gi', k8s_engine_pod_node_selector='', k8s_image_pull_policy='IfNotPresent', k8s_image_pull_secrets='', k8s_image_registry='registry.cn-hongkong.aliyuncs.com', k8s_image_repository='graphscope', k8s_image_tag='0.19.0', k8s_mars_scheduler_cpu=0.2, k8s_mars_scheduler_mem='2Mi', k8s_mars_worker_cpu=0.2, k8s_mars_worker_mem='4Mi', k8s_namespace='gs-new-orc', k8s_service_type='NodePort', k8s_vineyard_cpu=0.5, k8s_vineyard_daemonset=None, k8s_vineyard_image='vineyardcloudnative/vineyardd:v0.11.7', k8s_vineyard_mem='512Mi', k8s_volumes='', k8s_with_analytical=True, k8s_with_analytical_java=False, k8s_with_dataset=False, k8s_with_interactive=True, k8s_with_learning=True, k8s_with_mars=False, log_level='DEBUG', monitor=False, monitor_port=9968, num_workers=2, port=59788, preemptive=True, timeout_seconds=600, vineyard_shared_mem='2Gi', vineyard_socket=None, waiting_for_delete=False)
2023-03-02 02:57:43,783 [INFO][coordinator:972]: Coordinator server listen at 0.0.0.0:59788
2023-03-02 02:57:46,438 [INFO][kubernetes_launcher:405]: Create engine headless services...
2023-03-02 02:57:46,451 [INFO][kubernetes_launcher:410]: Creating engine pods...
2023-03-02 02:57:46,475 [INFO][kubernetes_launcher:419]: Creating frontend pods...
2023-03-02 02:57:46,492 [INFO][kubernetes_launcher:435]: Creating vineyard service...
2023-03-02 02:57:46,589 [INFO][kubernetes_launcher:472]: Waiting for services ready...
2023-03-02 02:57:46,658 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Successfully assigned gs-new-orc/gs-engine-gtmrxh-0 to k8s-worker1
2023-03-02 02:57:49,824 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Container image "registry.cn-hongkong.aliyuncs.com/graphscope/analytical:0.19.0" already present on machine
2023-03-02 02:57:49,825 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Created container engine
2023-03-02 02:57:49,827 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Started container engine
2023-03-02 02:57:49,828 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Container image "registry.cn-hongkong.aliyuncs.com/graphscope/interactive-executor:0.19.0" already present on machine
2023-03-02 02:57:49,830 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Created container executor
2023-03-02 02:57:49,831 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Started container executor
2023-03-02 02:57:49,833 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Container image "registry.cn-hongkong.aliyuncs.com/graphscope/learning:0.19.0" already present on machine
2023-03-02 02:57:49,834 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Created container learning
2023-03-02 02:57:49,835 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Started container learning
2023-03-02 02:57:49,837 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Container image "vineyardcloudnative/vineyardd:v0.11.7" already present on machine
2023-03-02 02:57:49,838 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Created container vineyard
2023-03-02 02:57:49,840 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-0]: Started container vineyard
2023-03-02 02:57:56,898 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Successfully assigned gs-new-orc/gs-engine-gtmrxh-1 to k8s-worker2
2023-03-02 02:57:56,900 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Container image "registry.cn-hongkong.aliyuncs.com/graphscope/analytical:0.19.0" already present on machine
2023-03-02 02:57:56,901 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Created container engine
2023-03-02 02:57:56,903 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Started container engine
2023-03-02 02:57:56,905 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Container image "registry.cn-hongkong.aliyuncs.com/graphscope/interactive-executor:0.19.0" already present on machine
2023-03-02 02:57:56,907 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Created container executor
2023-03-02 02:57:56,909 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Started container executor
2023-03-02 02:57:56,910 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Container image "registry.cn-hongkong.aliyuncs.com/graphscope/learning:0.19.0" already present on machine
2023-03-02 02:57:56,912 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Created container learning
2023-03-02 02:57:56,913 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Started container learning
2023-03-02 02:57:56,915 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Container image "vineyardcloudnative/vineyardd:v0.11.7" already present on machine
2023-03-02 02:57:56,916 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Created container vineyard
2023-03-02 02:57:56,918 [INFO][kubernetes_launcher:514]: [gs-engine-gtmrxh-1]: Started container vineyard
2023-03-02 02:57:59,961 [INFO][kubernetes_launcher:546]: GraphScope engines pod is ready.
2023-03-02 02:57:59,981 [INFO][kubernetes_launcher:547]: Engines pod name list: ['gs-engine-gtmrxh-0', 'gs-engine-gtmrxh-1']
2023-03-02 02:57:59,981 [INFO][kubernetes_launcher:548]: Engines pod ip list: ['10.244.2.179', '10.244.1.185']
2023-03-02 02:57:59,982 [INFO][kubernetes_launcher:549]: Engines pod host ip list: ['172.31.59.21', '172.31.48.79']
2023-03-02 02:57:59,982 [INFO][kubernetes_launcher:550]: Vineyard service endpoint: 172.31.59.21:32272
2023-03-02 02:57:59,987 [INFO][kubernetes_launcher:570]: Starting GAE rpc service on 10.244.2.179:56975 ...
2023-03-02 02:58:00,836 [DEBUG][utils:1896]: Resolve mpi cmd prefix: /opt/openmpi/bin/mpirun --allow-run-as-root --bind-to none -n 2 -host gs-engine-gtmrxh-0:1.0,gs-engine-gtmrxh-1:1.0
2023-03-02 02:58:00,838 [DEBUG][utils:1897]: Resolve mpi env: {"OMPI_MCA_btl_vader_single_copy_mechanism": "none", "OMPI_MCA_orte_allowed_exit_without_sync": "1", "OMPI_MCA_odls_base_sigkill_timeout": "0", "OMPI_MCA_plm_rsh_agent": "/usr/local/bin/kube_ssh"}
2023-03-02 02:58:00,839 [INFO][kubernetes_launcher:598]: Analytical engine launching command: /opt/openmpi/bin/mpirun --allow-run-as-root --bind-to none -n 2 -host gs-engine-gtmrxh-0:1.0,gs-engine-gtmrxh-1:1.0 /opt/graphscope/bin/grape_engine --host 0.0.0.0 --port 56975 -v 10 --vineyard_socket /tmp/vineyard_workspace/vineyard.sock
2023-03-02 02:58:00,891 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 0 time
2023-03-02 02:58:00,892 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
2023-03-02 02:58:03,895 [WARNING][op_executor:349]: Connecting to analytical engine... retrying 1 time
2023-03-02 02:58:03,895 [WARNING][op_executor:352]: Error code: StatusCode.UNAVAILABLE, details failed to connect to all addresses
10.244.2.179 gs-engine-gtmrxh-0
10.244.1.185 gs-engine-gtmrxh-1
10.244.2.179 gs-engine-gtmrxh-0
10.244.1.185 gs-engine-gtmrxh-1

siyuan0322 · 2023-03-02T05:39:36Z

Couldn't reproduce using same version and same commands, except for the k8s_client_config.
How does your k8s cluster setup? Is it by kubeadm, k3s, or by cloud provider? Could it be reproduced in your environments?

BTW, I'm interested in your user case, as it seems that you incorporated graphscope into your airflow dag? could you share more details about your target scenarios, or the graph problems you want to solve with us? cc @sighingnow

JackyYangPassion · 2023-03-06T15:34:42Z

How does your k8s cluster setup? Is it by kubeadm, k3s, or by cloud provider? Could it be reproduced in your environments?

kubeadm

JackyYangPassion · 2023-03-06T15:39:08Z

BTW, I'm interested in your user case, as it seems that you incorporated graphscope into your airflow dag? could you share more details about your target scenarios, or the graph problems you want to solve with us?

Mainly used to run GIE + GAE, analyze the offline graph, and want to use GraphScope as the basic computing engine!

siyuan0322 · 2023-03-07T03:30:09Z

I suspect it's a cluster network issue...
I searched for similar issues and found out that it may be caused by different hosts architecture open-mpi/ompi#4437
Could you confirm it?
Does GS with 1 workers works normally?

JackyYangPassion · 2023-03-07T05:46:41Z

Does GS with 1 workers works normally?

yes it's running ok

siyuan0322 · 2023-03-09T02:11:31Z

Specifing a newer tag: k8s_image_tag='0.20.0a20230306' will probably fix this.
And we will release 0.20.0 in a few days.

sighingnow · 2023-03-10T02:30:25Z

Resolved by #2341 and the root cause is a small difference in openmpi configuration between coordinator and analytical docker images.

The patch will be delivered in the next release.

sighingnow mentioned this issue Mar 10, 2023

[BUG] failed to launch GAE on Kubernetes when num_workers=2 #2496

Closed

sighingnow closed this as completed Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Use Graphscope session to deploy on k8s faild when config num_workers=2 #2479

[BUG] Use Graphscope session to deploy on k8s faild when config num_workers=2 #2479

JackyYangPassion commented Mar 1, 2023 •

edited

Loading

siyuan0322 commented Mar 2, 2023 •

edited

Loading

JackyYangPassion commented Mar 2, 2023 •

edited

Loading

siyuan0322 commented Mar 2, 2023

JackyYangPassion commented Mar 6, 2023 •

edited

Loading

JackyYangPassion commented Mar 6, 2023

siyuan0322 commented Mar 7, 2023

JackyYangPassion commented Mar 7, 2023

siyuan0322 commented Mar 9, 2023

sighingnow commented Mar 10, 2023

[BUG] Use Graphscope session to deploy on k8s faild when config num_workers=2 #2479

[BUG] Use Graphscope session to deploy on k8s faild when config num_workers=2 #2479

Comments

JackyYangPassion commented Mar 1, 2023 • edited Loading

siyuan0322 commented Mar 2, 2023 • edited Loading

JackyYangPassion commented Mar 2, 2023 • edited Loading

siyuan0322 commented Mar 2, 2023

JackyYangPassion commented Mar 6, 2023 • edited Loading

JackyYangPassion commented Mar 6, 2023

siyuan0322 commented Mar 7, 2023

JackyYangPassion commented Mar 7, 2023

siyuan0322 commented Mar 9, 2023

sighingnow commented Mar 10, 2023

JackyYangPassion commented Mar 1, 2023 •

edited

Loading

siyuan0322 commented Mar 2, 2023 •

edited

Loading

JackyYangPassion commented Mar 2, 2023 •

edited

Loading

JackyYangPassion commented Mar 6, 2023 •

edited

Loading