-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel hang with RHEL/Centos 7.5 when ABORTs are sent. #102
Comments
A little bit more info... When the lockup occurs you can't run systemctl restart tcmu-runner as it never returns. I believe that is what is preventing reboot also. |
Also noted that I am running the Centos 7.5 standard install of targetcli. Its version is slightly lower then the requirements on the ceph master docs. Do you think this is the issue and do you know where a pre compiled newer version may be found? targetcli.noarch 2.1.fb46-4.el7_5 |
You are using the RHEL/Centos 7.5 kernel right? It looks like you are hitting a bug in that kernel. It is missing this patch: We are working on getting a fix released. It is merged in our internal repos and being QA'd right now. It should be released in about a month. I will update this issue with the updated kernel version. |
Ok, that sounds good. If you want me to test it - I can. It seems that both iscsi nodes crash every few days. |
and yes - Centos 7.5 with latest updates. |
Did this get merged? |
It is in kernel-3.10.0-862.5 which should be released soon. I do not know how long it takes to go from a RHEL z stream to Centos kernel though. |
Seems the latest Centos patches helped. No more crashes. |
Hello, We have the same error with a 4.18.9-1 kernel with a single gateway (without multipathing) and ceph 13.2.0 . Sep 25 07:19:00 src-ceph-gwiscsi1 tcmu-runner: 2018-09-25 07:19:00.359 1204 [DEBUG] alua_implicit_transition:568 rbd/rbd.VMW_CEPH_DS08: lock state 1 |
You are hitting a similar issue in that a command is taking a long time which leads to the initiator timing it out and sending an abort. You not hitting the same issue that was being discussed in this GH issue though. The other user was hitting a issue that only occurred in the RHEL kernel because it was missing a patch. Because of this even when the command finally unjammed it self and completed the kernel hit a bug and would never unjam. In your case from the logs above it looks like the command just never completed. You would want to look in the ceph logs to check if something happened to the cluster or if tcmu-runner crashed so it could not complete the command. |
Hello, the incident we encounter blocks the service rbd-target-gw. We tried with the tcmu-runner 1.4.0. We have the same incident. Can you help us? |
The rbd-target-gw service and kernel are blocked on the stuck command. Is tcmu-runner running (just do a 'ps -u root | grep runner')? |
Hello, |
Do you mean when you see the kernel message about the hung_task_timeout_sec, tcmu-runner is not running still? Has it crashed? If so, could you get me the core dump if there is one or get the /var/log/messages around the time tcmu-runner crashes so we can see if there is a stack dump from that? Could you also get me the /var/log/tcmu-runner.log for around the same time? |
Hello, messages : The tcmu log stops at the time of the crash. |
I think we are talking about different things. The log snippet you posted above is not a crash. It is a warning indicating that a command is stuck. I am asking if when you see this is tcmu-runner running? If you run systemctl status tcmu-runner does it show it is running? If it is then the command is could be stuck in the ceph cluster somewhere. You then should check the /var/log/ceph logs on the OSDs to see if there are any errors in there. Also what tools have you used to set this up? Did you use the ceph-iscsi tools or did you run targetlci manually? |
Oh yeah, if you are setting it up yourself with targetcli set the osd_op_timeout to something like 15 seconds. This will force most commands to fail if they have not completed within that timeout. We can then verify that they are getting sent to the OSD and getting stuck there. |
Hi @mikechristie,
From what I see, I agree with your first idea that tcmu crashed and never complete the command. Could you give instruction to get the core dump of tcmu when the problem happen ? What can we provide to help ? |
Hello,
Centos 7.5
VMware 6.0 U3
Ceph 12.2.5 with all SSD bluestore pool. ( only 20 OSD in SSD pool )
Active/standby iscsi
I have started using ceph-iscsi-cli in a small - non important production environment and last night both iscsi nodes crashed. A reboot fixed them ( they wouldn't reboot without a hard reset. The OS seemed fine but a hung task is my guess )
I am not 100% sure where to look for the logs that relate to the issue. I will put more monitoring on the system.
May 25 15:54:51 SVR-AUBUN-SWD-ISCSI2 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 45
May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 tcmu-runner: 2018-05-25 15:55:11.794 856 [ERROR] tcmu_notify_conn_lost:187 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Handler connection lost (lock state 0)
May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 tcmu-runner: tcmu_notify_conn_lost:187 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Handler connection lost (lock state 0)
May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 45
May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 45
May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 tcmu-runner: 2018-05-25 15:55:11.798 856 [INFO] tgt_port_grp_recovery_thread_fn:253: Disabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_2.
May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 tcmu-runner: tgt_port_grp_recovery_thread_fn:253: Disabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_2.
May 25 15:55:14 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:14 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:14 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:14 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:15 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:15 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:17 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:17 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:17 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:17 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:18 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:18 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:20 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:20 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:20 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:20 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:21 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:21 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:23 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:23 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:23 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:23 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:24 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:24 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:26 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:26 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:26 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:26 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:27 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:27 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:29 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 16:27:47 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 16:27:47 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 16:27:49 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 16:27:49 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: INFO: task fn-radosclient:1555 blocked for more than 120 seconds.
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: fn-radosclient D ffff9df8b2218fd0 0 1555 1 0x00000080
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: Call Trace:
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] schedule+0x29/0x70
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] schedule_timeout+0x239/0x2c0
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? __send_signal+0x18e/0x490
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] wait_for_completion+0xfd/0x140
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? wake_up_state+0x20/0x20
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_cause_connection_reinstatement+0x9e/0x100 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_free_session+0x109/0x180 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_release_sessions_for_tpg+0x123/0x1e0 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_tpg_disable_portal_group+0xcf/0x1e0 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] lio_target_tpg_enable_store+0x6e/0xf0 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] configfs_write_file+0x107/0x140
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] vfs_write+0xc0/0x1f0
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? system_call_after_swapgs+0xc8/0x160
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] SyS_write+0x7f/0xf0
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? system_call_after_swapgs+0xc8/0x160
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] system_call_fastpath+0x1c/0x21
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? system_call_after_swapgs+0xc8/0x160
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: INFO: task iscsi_ttx:1528 blocked for more than 120 seconds.
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: iscsi_ttx D ffff9df7b194dee0 0 1528 2 0x00000084
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: Call Trace:
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] schedule+0x29/0x70
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] schedule_timeout+0x239/0x2c0
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? list_del+0xd/0x30
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] wait_for_completion+0xfd/0x140
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? wake_up_state+0x20/0x20
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] transport_generic_free_cmd+0xa2/0x150 [target_core_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_free_cmd+0x82/0x140 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_close_connection+0x4c6/0x8a0 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_take_action_for_connection_exit+0x8b/0x120 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsi_target_tx_thread+0x1f2/0x240 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? wake_up_atomic_t+0x30/0x30
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? iscsit_thread_get_cpumask+0xf0/0xf0 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] kthread+0xd1/0xe0
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? insert_kthread_work+0x40/0x40
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ret_from_fork_nospec_begin+0x21/0x21
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? insert_kthread_work+0x40/0x40
I think these may be unrelated but interesting.
May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 14:47:15.539 852 [WARN] tcmu_block_device:395 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Kernel does not support the block_dev action.
May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmu_block_device:395 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Kernel does not support the block_dev action.
May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 14:47:15.555 852 [WARN] tcmu_rbd_lock:735 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Acquired exclusive lock.
May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmu_rbd_lock:735 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Acquired exclusive lock.
May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmu_unblock_device:418 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Kernel does not support the block_dev action.
May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 14:47:15.563 852 [WARN] tcmu_unblock_device:418 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Kernel does not support the block_dev action.
May 25 17:06:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:06:22.619 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 17:06:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 17:16:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:16:22.546 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 17:16:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 17:30:57 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:30:57.074 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 17:30:57 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 17:36:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:36:22.638 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 17:36:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 17:46:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:46:22.545 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 17:46:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 18:00:57 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 18:00:57.089 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 18:00:57 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
The text was updated successfully, but these errors were encountered: