Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel hang with RHEL/Centos 7.5 when ABORTs are sent. #102

Open
Jazz9 opened this issue May 26, 2018 · 18 comments
Open

Kernel hang with RHEL/Centos 7.5 when ABORTs are sent. #102

Jazz9 opened this issue May 26, 2018 · 18 comments

Comments

@Jazz9
Copy link

Jazz9 commented May 26, 2018

Hello,

Centos 7.5
VMware 6.0 U3
Ceph 12.2.5 with all SSD bluestore pool. ( only 20 OSD in SSD pool )
Active/standby iscsi

I have started using ceph-iscsi-cli in a small - non important production environment and last night both iscsi nodes crashed. A reboot fixed them ( they wouldn't reboot without a hard reset. The OS seemed fine but a hung task is my guess )

I am not 100% sure where to look for the logs that relate to the issue. I will put more monitoring on the system.

May 25 15:54:51 SVR-AUBUN-SWD-ISCSI2 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 45
May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 tcmu-runner: 2018-05-25 15:55:11.794 856 [ERROR] tcmu_notify_conn_lost:187 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Handler connection lost (lock state 0)
May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 tcmu-runner: tcmu_notify_conn_lost:187 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Handler connection lost (lock state 0)
May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 45
May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 45
May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 tcmu-runner: 2018-05-25 15:55:11.798 856 [INFO] tgt_port_grp_recovery_thread_fn:253: Disabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_2.
May 25 15:55:11 SVR-AUBUN-SWD-ISCSI2 tcmu-runner: tgt_port_grp_recovery_thread_fn:253: Disabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_2.
May 25 15:55:14 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:14 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:14 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:14 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:15 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:15 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:17 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:17 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:17 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:17 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:18 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:18 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:20 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:20 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:20 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:20 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:21 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:21 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:23 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:23 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:23 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:23 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:24 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:24 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:26 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:26 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:26 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:26 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:27 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 15:55:27 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 15:55:29 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw

May 25 16:27:47 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 16:27:47 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 16:27:49 SVR-AUBUN-SWD-ISCSI2 kernel: Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
May 25 16:27:49 SVR-AUBUN-SWD-ISCSI2 kernel: iSCSI Login negotiation failed.
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: INFO: task fn-radosclient:1555 blocked for more than 120 seconds.
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: fn-radosclient D ffff9df8b2218fd0 0 1555 1 0x00000080
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: Call Trace:
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] schedule+0x29/0x70
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] schedule_timeout+0x239/0x2c0
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? __send_signal+0x18e/0x490
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] wait_for_completion+0xfd/0x140
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? wake_up_state+0x20/0x20
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_cause_connection_reinstatement+0x9e/0x100 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_free_session+0x109/0x180 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_release_sessions_for_tpg+0x123/0x1e0 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_tpg_disable_portal_group+0xcf/0x1e0 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] lio_target_tpg_enable_store+0x6e/0xf0 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] configfs_write_file+0x107/0x140
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] vfs_write+0xc0/0x1f0
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? system_call_after_swapgs+0xc8/0x160
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] SyS_write+0x7f/0xf0
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? system_call_after_swapgs+0xc8/0x160
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] system_call_fastpath+0x1c/0x21
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? system_call_after_swapgs+0xc8/0x160
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: INFO: task iscsi_ttx:1528 blocked for more than 120 seconds.
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: iscsi_ttx D ffff9df7b194dee0 0 1528 2 0x00000084
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: Call Trace:
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] schedule+0x29/0x70
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] schedule_timeout+0x239/0x2c0
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? list_del+0xd/0x30
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] wait_for_completion+0xfd/0x140
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? wake_up_state+0x20/0x20
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] transport_generic_free_cmd+0xa2/0x150 [target_core_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_free_cmd+0x82/0x140 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_close_connection+0x4c6/0x8a0 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsit_take_action_for_connection_exit+0x8b/0x120 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] iscsi_target_tx_thread+0x1f2/0x240 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? wake_up_atomic_t+0x30/0x30
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? iscsit_thread_get_cpumask+0xf0/0xf0 [iscsi_target_mod]
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] kthread+0xd1/0xe0
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? insert_kthread_work+0x40/0x40
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ret_from_fork_nospec_begin+0x21/0x21
May 25 16:27:50 SVR-AUBUN-SWD-ISCSI2 kernel: [] ? insert_kthread_work+0x40/0x40

I think these may be unrelated but interesting.

May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 14:47:15.539 852 [WARN] tcmu_block_device:395 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Kernel does not support the block_dev action.
May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmu_block_device:395 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Kernel does not support the block_dev action.
May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 14:47:15.555 852 [WARN] tcmu_rbd_lock:735 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Acquired exclusive lock.
May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmu_rbd_lock:735 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Acquired exclusive lock.
May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmu_unblock_device:418 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Kernel does not support the block_dev action.
May 25 14:47:15 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 14:47:15.563 852 [WARN] tcmu_unblock_device:418 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Kernel does not support the block_dev action.

May 25 17:06:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:06:22.619 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 17:06:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 17:16:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:16:22.546 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 17:16:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 17:30:57 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:30:57.074 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 17:30:57 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 17:36:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:36:22.638 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 17:36:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 17:46:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 17:46:22.545 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 17:46:22 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 18:00:57 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: 2018-05-25 18:00:57.089 852 [WARN] tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported
May 25 18:00:57 SVR-AUBUN-SWD-ISCSI1 tcmu-runner: tcmur_cmdproc_thread:635 rbd/AUBUN-VMW-CLUSTER01-SSD.AUBUN-VMW-Cluster01: Command 0x85 not supported

@Jazz9
Copy link
Author

Jazz9 commented May 26, 2018

A little bit more info...
Running - tcmu-runner-1.3.0-89.g7d0023e.el7.x86_64

When the lockup occurs you can't run systemctl restart tcmu-runner as it never returns. I believe that is what is preventing reboot also.

@Jazz9
Copy link
Author

Jazz9 commented May 26, 2018

Also noted that I am running the Centos 7.5 standard install of targetcli. Its version is slightly lower then the requirements on the ceph master docs. Do you think this is the issue and do you know where a pre compiled newer version may be found?

targetcli.noarch 2.1.fb46-4.el7_5

@mikechristie mikechristie changed the title Both iscsi nodes crashed Kernel hang with RHEL/Centos 7.5 when ABORTs are sent. May 30, 2018
@mikechristie
Copy link
Contributor

You are using the RHEL/Centos 7.5 kernel right?

It looks like you are hitting a bug in that kernel. It is missing this patch:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/target?id=527268df31e57cf2b6d417198717c6d6afdb1e3e

We are working on getting a fix released. It is merged in our internal repos and being QA'd right now. It should be released in about a month.

I will update this issue with the updated kernel version.

@Jazz9
Copy link
Author

Jazz9 commented May 31, 2018

Ok, that sounds good. If you want me to test it - I can. It seems that both iscsi nodes crash every few days.

@Jazz9
Copy link
Author

Jazz9 commented May 31, 2018

and yes - Centos 7.5 with latest updates.

@Jazz9
Copy link
Author

Jazz9 commented Jun 22, 2018

Did this get merged?

@mikechristie
Copy link
Contributor

It is in kernel-3.10.0-862.5 which should be released soon.

I do not know how long it takes to go from a RHEL z stream to Centos kernel though.

@Jazz9
Copy link
Author

Jazz9 commented Jul 10, 2018

Seems the latest Centos patches helped. No more crashes.

@VRAseb
Copy link

VRAseb commented Sep 25, 2018

Hello,

We have the same error with a 4.18.9-1 kernel with a single gateway (without multipathing) and ceph 13.2.0 .
Has this error been identified?
Is there a patch?

Sep 25 07:19:00 src-ceph-gwiscsi1 tcmu-runner: 2018-09-25 07:19:00.359 1204 [DEBUG] alua_implicit_transition:568 rbd/rbd.VMW_CEPH_DS08: lock state 1
Sep 25 07:19:09 src-ceph-gwiscsi1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 71303168
Sep 25 07:20:01 src-ceph-gwiscsi1 systemd: Started Session 190 of user root.
Sep 25 07:20:01 src-ceph-gwiscsi1 systemd: Starting Session 190 of user root.
Sep 25 07:20:27 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Sep 25 07:20:27 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Sep 25 07:20:43 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Sep 25 07:20:43 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Sep 25 07:20:58 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Sep 25 07:20:58 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Sep 25 07:21:14 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Sep 25 07:21:14 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Sep 25 07:21:29 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Sep 25 07:21:29 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Sep 25 07:21:44 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Sep 25 07:21:44 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: INFO: task kworker/u256:2:95199 blocked for more than 120 seconds.
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: Not tainted 4.18.9-1.el7.elrepo.x86_64 #1
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: kworker/u256:2 D 0 95199 2 0x80000080
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: Workqueue: tmr-user target_tmr_work [target_core_mod]
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: Call Trace:
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: __schedule+0x2ab/0x880
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: ? kvasprintf+0x7f/0xd0
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: schedule+0x36/0x80
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: schedule_timeout+0x161/0x300
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: ? __next_timer_interrupt+0xe0/0xe0
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: wait_for_completion_timeout+0x130/0x1a0
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: ? wake_up_q+0x80/0x80
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: __transport_wait_for_tasks+0xdd/0x140 [target_core_mod]
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: transport_wait_for_tasks+0x53/0x90 [target_core_mod]
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: core_tmr_abort_task+0xfb/0x180 [target_core_mod]
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: target_tmr_work+0x128/0x140 [target_core_mod]
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: process_one_work+0x179/0x390
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: worker_thread+0x4f/0x3e0
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: kthread+0x105/0x140
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: ? max_active_store+0x80/0x80
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: ? kthread_bind+0x20/0x20
Sep 25 07:21:50 src-ceph-gwiscsi1 kernel: ret_from_fork+0x35/0x40
Sep 25 07:22:00 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Sep 25 07:22:00 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Sep 25 07:22:15 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Sep 25 07:22:15 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Sep 25 07:22:30 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Sep 25 07:22:30 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Sep 25 07:22:46 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Sep 25 07:22:46 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Sep 25 07:23:01 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Sep 25 07:23:01 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Sep 25 07:23:16 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Sep 25 07:23:16 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Sep 25 07:23:32 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Sep 25 07:23:32 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Sep 25 07:23:47 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Sep 25 07:23:47 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Sep 25 07:23:53 src-ceph-gwiscsi1 kernel: INFO: task kworker/0:0:111368 blocked for more than 120 seconds.

@mikechristie
Copy link
Contributor

@VRAseb

You are hitting a similar issue in that a command is taking a long time which leads to the initiator timing it out and sending an abort.

You not hitting the same issue that was being discussed in this GH issue though. The other user was hitting a issue that only occurred in the RHEL kernel because it was missing a patch. Because of this even when the command finally unjammed it self and completed the kernel hit a bug and would never unjam.

In your case from the logs above it looks like the command just never completed. You would want to look in the ceph logs to check if something happened to the cluster or if tcmu-runner crashed so it could not complete the command.

@VRAseb
Copy link

VRAseb commented Sep 28, 2018

Hello, the incident we encounter blocks the service rbd-target-gw.
We can not restart the rbd-target-gw service.
We can not restart OS, the kernel is blocked.
We have no alerts in CEPH journals (HEALTH_OK)

We tried with the tcmu-runner 1.4.0. We have the same incident.

Can you help us?

@mikechristie
Copy link
Contributor

The rbd-target-gw service and kernel are blocked on the stuck command.

Is tcmu-runner running (just do a 'ps -u root | grep runner')?

@VRAseb
Copy link

VRAseb commented Oct 3, 2018

Hello,
tcmu-runner does not run after crash.

@mikechristie
Copy link
Contributor

tcmu-runner does not run after crash.

Do you mean when you see the kernel message about the hung_task_timeout_sec, tcmu-runner is not running still? Has it crashed? If so, could you get me the core dump if there is one or get the /var/log/messages around the time tcmu-runner crashes so we can see if there is a stack dump from that? Could you also get me the /var/log/tcmu-runner.log for around the same time?

@VRAseb
Copy link

VRAseb commented Oct 5, 2018

Hello,

messages :
Oct 4 05:45:01 src-ceph-gwiscsi1 systemd: Started Session 283 of user root.
Oct 4 05:45:01 src-ceph-gwiscsi1 systemd: Starting Session 283 of user root.
Oct 4 05:49:25 src-ceph-gwiscsi1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 97406060
Oct 4 05:49:35 src-ceph-gwiscsi1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 97406062
Oct 4 05:50:01 src-ceph-gwiscsi1 systemd: Started Session 284 of user root.
Oct 4 05:50:01 src-ceph-gwiscsi1 systemd: Starting Session 284 of user root.
Oct 4 05:51:02 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Oct 4 05:51:02 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48.
Oct 4 05:51:02 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Oct 4 05:51:17 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Oct 4 05:51:17 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48.
Oct 4 05:51:17 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Oct 4 05:51:33 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Oct 4 05:51:33 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48.
Oct 4 05:51:33 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Oct 4 05:51:48 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Oct 4 05:51:48 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48.
Oct 4 05:51:48 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Oct 4 05:52:03 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Oct 4 05:52:03 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48.
Oct 4 05:52:03 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Oct 4 05:52:19 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Oct 4 05:52:19 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48.
Oct 4 05:52:19 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Oct 4 05:52:34 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Oct 4 05:52:34 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48.
Oct 4 05:52:34 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iSCSI Login timeout on Network Portal 10.61.0.45:3260
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: INFO: task iscsi_trx:2513 blocked for more than 120 seconds.
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: Not tainted 4.18.9-1.el7.elrepo.x86_64 #1
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsi_trx D 0 2513 2 0x80000080
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: Call Trace:
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: __schedule+0x2ab/0x880
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? kvasprintf+0x7f/0xd0
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: schedule+0x36/0x80
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: schedule_timeout+0x161/0x300
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? __next_timer_interrupt+0xe0/0xe0
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: wait_for_completion_timeout+0x130/0x1a0
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? wake_up_q+0x80/0x80
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: __transport_wait_for_tasks+0xdd/0x140 [target_core_mod]
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: target_wait_free_cmd+0x56/0x90 [target_core_mod]
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: transport_generic_free_cmd+0x103/0x140 [target_core_mod]
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsit_free_cmd+0x43/0x90 [iscsi_target_mod]
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsit_close_connection+0x586/0x8f0 [iscsi_target_mod]
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsit_take_action_for_connection_exit+0x83/0x110 [iscsi_target_mod]
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsi_target_rx_thread+0xe2/0xf0 [iscsi_target_mod]
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: kthread+0x105/0x140
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? iscsi_target_tx_thread+0x1f0/0x1f0 [iscsi_target_mod]
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? kthread_bind+0x20/0x20
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ret_from_fork+0x35/0x40
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: INFO: task kworker/0:0:164935 blocked for more than 120 seconds.
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: Not tainted 4.18.9-1.el7.elrepo.x86_64 #1
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: kworker/0:0 D 0 164935 2 0x80000080
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: Workqueue: events iscsi_target_do_login_rx [iscsi_target_mod]
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: Call Trace:
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: __schedule+0x2ab/0x880
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: schedule+0x36/0x80
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: schedule_timeout+0x1dc/0x300
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: wait_for_completion+0x121/0x180
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? wake_up_q+0x80/0x80
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsit_stop_session+0x1b1/0x1c0 [iscsi_target_mod]
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsi_check_for_session_reinstatement+0x1ed/0x280 [iscsi_target_mod]
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsi_target_check_for_existing_instances+0x30/0x40 [iscsi_target_mod]
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsi_target_do_login+0x31d/0x5f0 [iscsi_target_mod]
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iscsi_target_do_login_rx+0x26f/0x340 [iscsi_target_mod]
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? iscsi_target_check_login_request+0x1a0/0x1a0 [iscsi_target_mod]
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: process_one_work+0x179/0x390
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: worker_thread+0x4f/0x3e0
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: kthread+0x105/0x140
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? max_active_store+0x80/0x80
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ? kthread_bind+0x20/0x20
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: ret_from_fork+0x35/0x40
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: tx_data returned -32, expecting 48.
Oct 4 05:52:50 src-ceph-gwiscsi1 kernel: iSCSI Login negotiation failed.

The tcmu log stops at the time of the crash.

@mikechristie
Copy link
Contributor

I think we are talking about different things. The log snippet you posted above is not a crash. It is a warning indicating that a command is stuck.

I am asking if when you see this is tcmu-runner running? If you run

systemctl status tcmu-runner

does it show it is running?

If it is then the command is could be stuck in the ceph cluster somewhere. You then should check the /var/log/ceph logs on the OSDs to see if there are any errors in there.

Also what tools have you used to set this up? Did you use the ceph-iscsi tools or did you run targetlci manually?

@mikechristie
Copy link
Contributor

Oh yeah, if you are setting it up yourself with targetcli set the osd_op_timeout to something like 15 seconds. This will force most commands to fail if they have not completed within that timeout. We can then verify that they are getting sent to the OSD and getting stuck there.

@manens
Copy link

manens commented Oct 9, 2018

Hi @mikechristie,
I investigated this case and this is what I can tell :

  • we didn't see any problem on ceph cluster (as yet)
  • tcmu is not running when, we look in the VM
  • this was setup with ceph-iscsi tools
  • ESX is connected to the gateway and is sending ABORT_TASK after something like 5s and later start sending Login requests
  • System require a reset to restart

From what I see, I agree with your first idea that tcmu crashed and never complete the command.

Could you give instruction to get the core dump of tcmu when the problem happen ? What can we provide to help ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants