[Users] SRP Abort called: five out of 66 paths offline

Tom Robinson tom.robinson at motec.com.au
Thu Feb 5 15:31:43 PST 2015


Hi,

Sorry for the long post (lots of logging) so first my questions: should I be concerned about this
WARNING? And, how do I get my faulty paths back online?

We are using COMSTAR targets on OmniOS and OFED (3.12) stacked on CentOS 7. Last night we got this
WARNING in our linux host logs:

Feb  5 22:28:33 daytona kernel: ------------[ cut here ]------------
Feb  5 22:28:33 daytona kernel: WARNING: at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0()
Feb  5 22:28:33 daytona kernel: list_del corruption. prev->next should be ffff88343e85fd38, but was
ffff881faf801e80
Feb  5 22:28:33 daytona kernel: Modules linked in: vhost_net macvtap macvlan tun ipt_MASQUERADE
xt_CHECKSUM sg rdma_ucm(OF) rdma_cm(OF) ib_addr(OF) iw_cm(OF) ib_srp(OF) scsi_transport_srp(OF)
scsi_tgt ib_uverbs(OF) ib_umad(OF) ocrdma(OF) be2net(OF) iw_nes(OF) iw_cxgb4(OF) cxgb4(OF
) mlx5_ib(OF) mlx5_core(OF) ib_mthca(OF) ts_kmp nf_conntrack_amanda ip6t_rpfilter ip6t_REJECT
ipt_REJECT ib_ipoib(OF) xt_conntrack ib_cm(OF) ebtable_nat ebtable_broute ebtable_filter ebtables
ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_securi
ty ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables iTCO_wdt
iTCO_vendor_support mxm_wmi usb_storage coretemp kvm_intel kvm crct10dif_pclmul
Feb  5 22:28:33 daytona kernel: crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul
glue_helper ablk_helper cryptd pcspkr mei_me sb_edac lpc_ich mei edac_core i2c_i801 mfd_core ntb
ioatdma shpchp ipmi_si ipmi_msghandler wmi acpi_cpufreq mperf nfsd auth_rpcgss nfs_acl lockd s
unrpc xfs libcrc32c dm_service_time sd_mod crc_t10dif crct10dif_common bridge stp llc mlx4_ib(OF)
ib_sa(OF) ib_mad(OF) mlx4_en(OF) mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper ttm
crc32c_intel ixgbe igb drm ptp pps_core dca mlx4_core(OF) i2c_algo_bit i2c_core dm_mirror
 dm_region_hash dm_log iscsi_tcp be2iscsi bnx2i cnic uio cxgb3i iw_cxgb3(OF) ib_core(OF) cxgb3(OF)
compat(OF) mdio libcxgbi libiscsi_tcp libiscsi scsi_transport_iscsi iscsi_ibft iscsi_boot_sysfs
dm_multipath dm_mod
Feb  5 22:28:33 daytona kernel: CPU: 13 PID: 10252 Comm: kworker/u66:1 Tainted: GF       W 
O--------------   3.10.0-123.13.2.el7.x86_64 #1
Feb  5 22:28:33 daytona kernel: Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
Feb  5 22:28:33 daytona kernel: Workqueue: scsi_tmf_5 scmd_eh_abort_handler
Feb  5 22:28:33 daytona kernel: ffff88343e85fc90 0000000007b1ee02 ffff88343e85fc48 ffffffff815e232c
Feb  5 22:28:33 daytona kernel: ffff88343e85fc80 ffffffff8105dee1 ffff88343e85fd38 0000000000000000
Feb  5 22:28:33 daytona kernel: ffff881faf801e78 0000000000000000 ffff881fc9ad0000 ffff88343e85fce8
Feb  5 22:28:33 daytona kernel: Call Trace:
Feb  5 22:28:33 daytona kernel: [<ffffffff815e232c>] dump_stack+0x19/0x1b
Feb  5 22:28:33 daytona kernel: [<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80
Feb  5 22:28:33 daytona kernel: [<ffffffff8105df5c>] warn_slowpath_fmt+0x5c/0x80
Feb  5 22:28:33 daytona kernel: [<ffffffff812cffa1>] __list_del_entry+0xa1/0xd0
Feb  5 22:28:33 daytona kernel: [<ffffffff812cffdd>] list_del+0xd/0x30
Feb  5 22:28:33 daytona kernel: [<ffffffff815e8487>] wait_for_completion_timeout+0x147/0x180
Feb  5 22:28:33 daytona kernel: [<ffffffff810977e0>] ? wake_up_state+0x20/0x20
Feb  5 22:28:33 daytona kernel: [<ffffffffa096bc96>] srp_send_tsk_mgmt+0x1a6/0x300 [ib_srp]
Feb  5 22:28:33 daytona kernel: [<ffffffffa096bf6a>] srp_abort+0xaa/0x150 [ib_srp]
Feb  5 22:28:33 daytona kernel: [<ffffffff813d901f>] scmd_eh_abort_handler+0xbf/0x480
Feb  5 22:28:33 daytona kernel: [<ffffffff8107e03b>] process_one_work+0x17b/0x460
Feb  5 22:28:33 daytona kernel: [<ffffffff8107ee0b>] worker_thread+0x11b/0x400
Feb  5 22:28:33 daytona kernel: [<ffffffff8107ecf0>] ? rescuer_thread+0x400/0x400
Feb  5 22:28:33 daytona kernel: [<ffffffff81085aff>] kthread+0xcf/0xe0
Feb  5 22:28:33 daytona kernel: [<ffffffff81085a30>] ? kthread_create_on_node+0x140/0x140
Feb  5 22:28:33 daytona kernel: [<ffffffff815f296c>] ret_from_fork+0x7c/0xb0
Feb  5 22:28:33 daytona kernel: [<ffffffff81085a30>] ? kthread_create_on_node+0x140/0x140
Feb  5 22:28:33 daytona kernel: ---[ end trace 4aaed1e1e524450e ]---

At that time five paths (all on the same HA) went offline and have remained offline. Note that there
are 66 paths in total; only five went offline. We have mutlipathing set up so everything is still
running (KVM guests are all still up).

Feb  6 10:16:46 daytona multipathd: 3600144f000000000000052773fca0006: sdcj - path offline
Feb  6 10:16:46 daytona multipathd: 3600144f00000000000005277442e0007: sdck - path offline
Feb  6 10:16:46 daytona multipathd: 3600144f0000000000000529713f1000a: sddk - path offline
Feb  6 10:16:46 daytona multipathd: 3600144f000000000000053a8c4ed0001: sder - path offline
Feb  6 10:16:46 daytona multipathd: 3600144f000000000000054c7fa52000d: sdfr - path offline

[root at daytona offlined]# multipath -ll | grep offline
| `- 5:0:0:64 sdfr 130:208 failed faulty offline
| `- 5:0:0:14 sdcj 69:112  failed faulty offline
| `- 5:0:0:15 sdck 69:128  failed faulty offline
| `- 5:0:0:29 sddk 71:32   failed faulty offline
| `- 5:0:0:47 sder 129:48  failed faulty offline
[root at daytona offlined]#

3600144f000000000000052773fca0006 dm-8 SUN     ,COMSTAR        
size=10G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=10 status=active
| |- 4:0:0:14 sdch 69:80   active ready  running
| `- 5:0:0:14 sdcj 69:112  failed faulty offline
`-+- policy='service-time 0' prio=1 status=enabled
  `- 0:0:0:14 sdm  8:192   active ready  running
3600144f00000000000005277442e0007 dm-10 SUN     ,COMSTAR        
size=100G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=10 status=active
| |- 4:0:0:15 sdcn 69:176  active ready  running
| `- 5:0:0:15 sdck 69:128  failed faulty offline
`-+- policy='service-time 0' prio=1 status=enabled
  `- 0:0:0:15 sdn  8:208   active ready  running
3600144f0000000000000529713f1000a dm-15 SUN     ,COMSTAR        
size=100G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=10 status=active
| |- 4:0:0:29 sddl 71:48   active ready  running
| `- 5:0:0:29 sddk 71:32   failed faulty offline
`-+- policy='service-time 0' prio=1 status=enabled
  `- 0:0:0:29 sdaa 65:160  active ready  running
3600144f000000000000053a8c4ed0001 dm-62 SUN     ,COMSTAR        
size=2.0T features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=10 status=active
| |- 4:0:0:47 sden 128:240 active ready  running
| `- 5:0:0:47 sder 129:48  failed faulty offline
`-+- policy='service-time 0' prio=1 status=enabled
  `- 0:0:0:47 sdap 66:144  active ready  running
3600144f000000000000054c7fa52000d dm-68 SUN     ,COMSTAR        
size=10G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=10 status=active
| |- 4:0:0:64 sdbi 67:192  active ready  running
| `- 5:0:0:64 sdfr 130:208 failed faulty offline
`-+- policy='service-time 0' prio=1 status=enabled
  `- 0:0:0:64 sdbg 67:160  active ready  running


At the same time (Feb  5 22:25), on the COMSTAR side of the link we see the following logging:

==> /var/adm/messages <==
Feb  5 22:25:17 monza.motec.com.au scsi: [ID 107833 kern.notice]
/pci at 79,0/pci8086,3c02 at 1/pci1000,3040 at 0 (mpt_sas1):
Feb  5 22:25:17 monza.motec.com.au      Timeout of 60 seconds expired with 5 commands on target 67
lun 0.
Feb  5 22:25:17 monza.motec.com.au scsi: [ID 107833 kern.warning] WARNING:
/pci at 79,0/pci8086,3c02 at 1/pci1000,3040 at 0 (mpt_sas1):
Feb  5 22:25:17 monza.motec.com.au      Disconnected command timeout for target 67 w5000c500564d9405.
Feb  5 22:25:17 monza.motec.com.au scsi: [ID 365881 kern.info]
/pci at 79,0/pci8086,3c02 at 1/pci1000,3040 at 0 (mpt_sas1):
Feb  5 22:25:17 monza.motec.com.au      Log info 0x31140000 received for target 67 w5000c500564d9405.
Feb  5 22:25:17 monza.motec.com.au      scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Feb  5 22:25:17 monza.motec.com.au scsi: [ID 365881 kern.info]
/pci at 79,0/pci8086,3c02 at 1/pci1000,3040 at 0 (mpt_sas1):
Feb  5 22:25:17 monza.motec.com.au      Log info 0x31140000 received for target 67 w5000c500564d9405.
Feb  5 22:25:17 monza.motec.com.au      scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Feb  5 22:25:17 monza.motec.com.au scsi: [ID 365881 kern.info]
/pci at 79,0/pci8086,3c02 at 1/pci1000,3040 at 0 (mpt_sas1):
Feb  5 22:25:17 monza.motec.com.au      Log info 0x31140000 received for target 67 w5000c500564d9405.
Feb  5 22:25:17 monza.motec.com.au      scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Feb  5 22:25:17 monza.motec.com.au scsi: [ID 365881 kern.info]
/pci at 79,0/pci8086,3c02 at 1/pci1000,3040 at 0 (mpt_sas1):
Feb  5 22:25:17 monza.motec.com.au      Log info 0x31140000 received for target 67 w5000c500564d9405.
Feb  5 22:25:17 monza.motec.com.au      scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Feb  5 22:25:17 monza.motec.com.au scsi: [ID 365881 kern.info]
/pci at 79,0/pci8086,3c02 at 1/pci1000,3040 at 0 (mpt_sas1):
Feb  5 22:25:17 monza.motec.com.au      Log info 0x31140000 received for target 67 w5000c500564d9405.
Feb  5 22:25:17 monza.motec.com.au      scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Feb  5 22:25:17 monza.motec.com.au scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
Feb  5 22:25:17 monza.motec.com.au      /scsi_vhci/disk at g5000c500564d9407 (sd24): Command Timeout on
path mpt_sas8/disk at w5000c500564d9405,0
Feb  5 22:25:17 monza.motec.com.au scsi: [ID 107833 kern.notice]
/pci at 79,0/pci8086,3c08 at 3/pci1000,3040 at 0 (mpt_sas2):
Feb  5 22:25:17 monza.motec.com.au      Timeout of 60 seconds expired with 6 commands on target 33
lun 0.
Feb  5 22:25:17 monza.motec.com.au scsi: [ID 107833 kern.warning] WARNING:
/pci at 79,0/pci8086,3c08 at 3/pci1000,3040 at 0 (mpt_sas2):
Feb  5 22:25:17 monza.motec.com.au      Disconnected command timeout for target 33 w5000c500564d9406.
Feb  5 22:25:18 monza.motec.com.au scsi: [ID 365881 kern.info]
/pci at 79,0/pci8086,3c08 at 3/pci1000,3040 at 0 (mpt_sas2):
Feb  5 22:25:18 monza.motec.com.au      Log info 0x31140000 received for target 33 w5000c500564d9406.
Feb  5 22:25:18 monza.motec.com.au      scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Feb  5 22:25:18 monza.motec.com.au scsi: [ID 365881 kern.info]
/pci at 79,0/pci8086,3c08 at 3/pci1000,3040 at 0 (mpt_sas2):
Feb  5 22:25:18 monza.motec.com.au      Log info 0x31140000 received for target 33 w5000c500564d9406.
Feb  5 22:25:18 monza.motec.com.au      scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Feb  5 22:25:18 monza.motec.com.au scsi: [ID 365881 kern.info]
/pci at 79,0/pci8086,3c08 at 3/pci1000,3040 at 0 (mpt_sas2):
Feb  5 22:25:18 monza.motec.com.au      Log info 0x31140000 received for target 33 w5000c500564d9406.
Feb  5 22:25:18 monza.motec.com.au      scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Feb  5 22:25:18 monza.motec.com.au scsi: [ID 365881 kern.info]
/pci at 79,0/pci8086,3c08 at 3/pci1000,3040 at 0 (mpt_sas2):
Feb  5 22:25:18 monza.motec.com.au      Log info 0x31140000 received for target 33 w5000c500564d9406.
Feb  5 22:25:18 monza.motec.com.au      scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Feb  5 22:25:18 monza.motec.com.au scsi: [ID 365881 kern.info]
/pci at 79,0/pci8086,3c08 at 3/pci1000,3040 at 0 (mpt_sas2):
Feb  5 22:25:18 monza.motec.com.au      Log info 0x31140000 received for target 33 w5000c500564d9406.
Feb  5 22:25:18 monza.motec.com.au      scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Feb  5 22:25:18 monza.motec.com.au scsi: [ID 365881 kern.info]
/pci at 79,0/pci8086,3c08 at 3/pci1000,3040 at 0 (mpt_sas2):
Feb  5 22:25:18 monza.motec.com.au      Log info 0x31140000 received for target 33 w5000c500564d9406.
Feb  5 22:25:18 monza.motec.com.au      scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Feb  5 22:25:18 monza.motec.com.au scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
Feb  5 22:25:18 monza.motec.com.au      /scsi_vhci/disk at g5000c500564d9407 (sd24): Command Timeout on
path mpt_sas11/disk at w5000c500564d9406,0
Feb  5 22:26:18 monza.motec.com.au scsi: [ID 107833 kern.notice]
/pci at 79,0/pci8086,3c02 at 1/pci1000,3040 at 0 (mpt_sas1):
Feb  5 22:26:18 monza.motec.com.au      Timeout of 60 seconds expired with 1 commands on target 67
lun 0.
Feb  5 22:26:18 monza.motec.com.au scsi: [ID 107833 kern.warning] WARNING:
/pci at 79,0/pci8086,3c02 at 1/pci1000,3040 at 0 (mpt_sas1):
Feb  5 22:26:18 monza.motec.com.au      Disconnected command timeout for target 67 w5000c500564d9405.
Feb  5 22:26:18 monza.motec.com.au scsi: [ID 365881 kern.info]
/pci at 79,0/pci8086,3c02 at 1/pci1000,3040 at 0 (mpt_sas1):
Feb  5 22:26:18 monza.motec.com.au      Log info 0x31140000 received for target 67 w5000c500564d9405.
Feb  5 22:26:18 monza.motec.com.au      scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Feb  5 22:27:19 monza.motec.com.au scsi: [ID 107833 kern.notice]
/pci at 79,0/pci8086,3c02 at 1/pci1000,3040 at 0 (mpt_sas1):
Feb  5 22:27:19 monza.motec.com.au      Timeout of 60 seconds expired with 2 commands on target 67
lun 0.
Feb  5 22:27:19 monza.motec.com.au scsi: [ID 107833 kern.warning] WARNING:
/pci at 79,0/pci8086,3c02 at 1/pci1000,3040 at 0 (mpt_sas1):
Feb  5 22:27:19 monza.motec.com.au      Disconnected command timeout for target 67 w5000c500564d9405.
Feb  5 22:27:19 monza.motec.com.au scsi: [ID 365881 kern.info]
/pci at 79,0/pci8086,3c02 at 1/pci1000,3040 at 0 (mpt_sas1):
Feb  5 22:27:19 monza.motec.com.au      Log info 0x31140000 received for target 67 w5000c500564d9405.
Feb  5 22:27:19 monza.motec.com.au      scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Feb  5 22:27:19 monza.motec.com.au scsi: [ID 365881 kern.info]
/pci at 79,0/pci8086,3c02 at 1/pci1000,3040 at 0 (mpt_sas1):
Feb  5 22:27:19 monza.motec.com.au      Log info 0x31140000 received for target 67 w5000c500564d9405.
Feb  5 22:27:19 monza.motec.com.au      scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Feb  5 22:28:17 monza.motec.com.au scsi: [ID 107833 kern.notice]
/pci at 79,0/pci8086,3c08 at 3/pci1000,3040 at 0 (mpt_sas2):
Feb  5 22:28:17 monza.motec.com.au      Timeout of 60 seconds expired with 3 commands on target 33
lun 0.
Feb  5 22:28:17 monza.motec.com.au scsi: [ID 107833 kern.warning] WARNING:
/pci at 79,0/pci8086,3c08 at 3/pci1000,3040 at 0 (mpt_sas2):
Feb  5 22:28:17 monza.motec.com.au      Disconnected command timeout for target 33 w5000c500564d9406.
Feb  5 22:28:18 monza.motec.com.au scsi: [ID 365881 kern.info]
/pci at 79,0/pci8086,3c08 at 3/pci1000,3040 at 0 (mpt_sas2):
Feb  5 22:28:18 monza.motec.com.au      Log info 0x31140000 received for target 33 w5000c500564d9406.
Feb  5 22:28:18 monza.motec.com.au      scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Feb  5 22:28:18 monza.motec.com.au scsi: [ID 365881 kern.info]
/pci at 79,0/pci8086,3c08 at 3/pci1000,3040 at 0 (mpt_sas2):
Feb  5 22:28:18 monza.motec.com.au      Log info 0x31140000 received for target 33 w5000c500564d9406.
Feb  5 22:28:18 monza.motec.com.au      scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Feb  5 22:28:18 monza.motec.com.au scsi: [ID 365881 kern.info]
/pci at 79,0/pci8086,3c08 at 3/pci1000,3040 at 0 (mpt_sas2):
Feb  5 22:28:18 monza.motec.com.au      Log info 0x31140000 received for target 33 w5000c500564d9406.
Feb  5 22:28:18 monza.motec.com.au      scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc
Feb  5 22:28:20 monza.motec.com.au scsi: [ID 107833 kern.notice]
/pci at 79,0/pci8086,3c02 at 1/pci1000,3040 at 0 (mpt_sas1):
Feb  5 22:28:20 monza.motec.com.au      Timeout of 60 seconds expired with 1 commands on target 67
lun 0.
Feb  5 22:28:20 monza.motec.com.au scsi: [ID 107833 kern.warning] WARNING:
/pci at 79,0/pci8086,3c02 at 1/pci1000,3040 at 0 (mpt_sas1):
Feb  5 22:28:20 monza.motec.com.au      Disconnected command timeout for target 67 w5000c500564d9405.
Feb  5 22:28:20 monza.motec.com.au scsi: [ID 365881 kern.info]
/pci at 79,0/pci8086,3c02 at 1/pci1000,3040 at 0 (mpt_sas1):
Feb  5 22:28:20 monza.motec.com.au      Log info 0x31140000 received for target 67 w5000c500564d9405.
Feb  5 22:28:20 monza.motec.com.au      scsi_status=0x0, ioc_status=0x8048, scsi_state=0xc

I'm not sure which side is at fault here (COMSTAR/OFED). I suspect COMSTAR but the linux WARNING
doesn't look pretty either.

Anyone care to comment?

Kind regards,
Tom

-- 

Tom Robinson
IT Manager/System Administrator

MoTeC Pty Ltd

121 Merrindale Drive
Croydon South
3136 Victoria
Australia

T: +61 3 9761 5050
F: +61 3 9761 5051   
E: tom.robinson at motec.com.au


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20150206/8fccf298/attachment.sig>


More information about the Users mailing list