[ofa-general] SPAM SFS 3012 SRP problem

Wed Dec 19 09:31:39 PST 2007

If you have a Cisco supoport contract, you should open a case with the
Cisco TAC.

What kind of FC storage are you using?

The chassis syslog message show the host is unresponsive (the
OUT_SERVICE and IN_SERVICE message).  Do the timing of these messages
match the ib_srp messages on the host?

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems

________________________________

	From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jeroen Van
Aken
	Sent: Wednesday, December 19, 2007 6:54 AM
	To: general at lists.openfabrics.org
	Subject: [ofa-general] ***SPAM*** SFS 3012 SRP problem

	Hello

	We are doing some SRP tests with the Cisco SFS 3012 Gateway. We
connected 4 hosts, each with 2 infiniband cables on one dual infiniband
card to the SFS3012 gateway. The gateway is also connected to our fibre
channel storage.  The ofed used is OFED-1.3-beta2 on each of the hosts.
The infiniband cards used are InfiniBand: Mellanox Technologies MT25208
InfiniHost III Ex (rev a0) and  Mellanox Technologies MT23108 InfiniHost
(rev a1) cards.

	When generating heavy load over the switch (by reading from our
FC storage over all the luns simultaneously), we sometimes get the
following errors:

	On the hosts: 

	Dec 13 13:07:54 gpfs4n1 syslog-ng[8212]: STATS: dropped 0

	Dec 13 13:20:26 gpfs4n1 run_srp_daemon[8422]: failed srp_daemon:
[HCA=mthca0] [port=1] [exit status=110]. Will try to restart srp_daemon
periodically. No mor

	e warnings will be issued in the next 7200 seconds if the same
problem repeats

	Dec 13 13:20:27 gpfs4n1 run_srp_daemon[8428]: starting
srp_daemon: [HCA=mthca0] [port=1]

	Dec 13 14:01:20 gpfs4n1 sshd[8539]: Accepted
keyboard-interactive/pam for root from 172.16.0.18 port 3545 ssh2

	Dec 13 14:07:55 gpfs4n1 syslog-ng[8212]: STATS: dropped 0

	Dec 13 14:13:01 gpfs4n1 syslog-ng[8212]: Changing permissions on
special file /dev/xconsole

	Dec 13 14:13:01 gpfs4n1 syslog-ng[8212]: Changing permissions on
special file /dev/tty10

	Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

	Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

	Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

	Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

	Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

	Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

	Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

	Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

	Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed send status 12

	Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

	Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

	Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

	Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

	Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

	Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

	Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed send status 12

	Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

	Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

	Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

	Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

	Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

	Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

	Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

	Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

	Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

	On the switch ts_log

	**************************************SWITCH
LOG***************************************************************

	Dec 13 14:04:30 topspin-cc ib_sm.x[1357]: [INFO]: Configuration
caused by multicast membership change

	Dec 13 14:05:49 topspin-cc ib_sm.x[1383]: [INFO]: Session not
initiated: Cold Sync Limit exceeded for Standby SM guid
00:05:ad:00:00:08:94:5d

	Dec 13 14:07:49 topspin-cc ib_sm.x[1383]: [INFO]: Initialize a
backup session with Standby SM guid 00:05:ad:00:00:08:94:5d

	Dec 13 14:07:59 topspin-cc ib_sm.x[1383]: [INFO]: Session
initialization failed with Standby SM guid 00:05:ad:00:00:08:94:5d

	Dec 13 14:09:59 topspin-cc ib_sm.x[1383]: [INFO]: Initialize a
backup session with Standby SM guid 00:05:ad:00:00:08:94:5d

	Dec 13 14:10:09 topspin-cc ib_sm.x[1383]: [INFO]: Session
initialization failed with Standby SM guid 00:05:ad:00:00:08:94:5d

	Dec 13 14:12:06 topspin-cc ib_sm.x[1357]: [INFO]: Generate SM
OUT_OF_SERVICE trap for
GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:1d:ce:21

	Dec 13 14:12:06 topspin-cc ib_sm.x[1357]: [INFO]: Generate SM
OUT_OF_SERVICE trap for
GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:1d:ce:22

	Dec 13 14:12:06 topspin-cc ib_sm.x[1357]: [INFO]: Configuration
caused by discovering removed ports

	Dec 13 14:12:07 topspin-cc ib_sm.x[1357]: [INFO]: Configuration
caused by multicast membership change

	Dec 13 14:12:09 topspin-cc ib_sm.x[1383]: [INFO]: Session not
initiated: Cold Sync Limit exceeded for Standby SM guid
00:05:ad:00:00:08:94:5d

	Dec 13 14:12:18 topspin-cc chassis_mgr.x[1084]: [INFO]: ipc:
select(fd=28) failed for read, err=11, t1=1, t2=0

	Dec 13 14:12:22 topspin-cc last message repeated 4 times

	Dec 13 14:12:36 topspin-cc ib_sm.x[1357]: [INFO]: Configuration
caused by discovering new ports

	Dec 13 14:12:37 topspin-cc ib_sm.x[1357]: [INFO]: Generate SM
IN_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:1d:ce:21

	Dec 13 14:12:37 topspin-cc ib_sm.x[1357]: [INFO]: Generate SM
IN_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:1d:ce:22

	13 14:12:3

	Dec 13 14:12:38 topspin-cc ib_sm.x[1357]: [INFO]: Configuration
caused by multicast membership change

	13 14:12:3

	13 14:12:3

	13 14:12:3

	13 14:12:3

	13 14:12:3

	13 14:12:3

	13 14:12:3

	13 14:12:3

	13 14:12:3

	13 14:12:4

	13 14:12:4

	13 14:12:4

	13 14:12:4

	13 14:12:4

	13 14:12:4

	13 14:12:4

	13 14:12:4

	13 14:12:4

	13 14:12:4

	13 14:12:4

	13 14:12:4

	13 14:12:4

	13 14:12:4

	13 14:12:4

	13 14:12:4

	13 14:12:4

	13 14:12:4

	Dec 13 14:13:28 topspin-cc chassis_mgr.x[1084]: [WARN]:
tsIpcMessageSend failed, fd=28, vp=2, err=104, Connection reset by peer

	Dec 13 14:13:39 topspin-cc chassis_mgr.x[1084]: [INFO]: ipc:
select(fd=28) failed for write, err=11, t1=10, t2=0

	Dec 13 14:13:39 topspin-cc chassis_mgr.x[1084]: [INFO]:
tsIpcMessageSend failed, fd=28, vp=2, err=11, Resource temporarily
unavailable

	Dec 13 14:13:40 topspin-cc snmp_agent.x[1208]: [INFO]: ipc:
select(fd=5) failed for read, err=11, t1=10, t2=0

	Dec 13 14:13:46 topspin-cc web_agent.x[1370]: [INFO]: ipc:
select(fd=3) failed for read, err=11, t1=10, t2=0

	Dec 13 14:13:50 topspin-cc snmp_agent.x[1208]: [INFO]: ipc:
select(fd=5) failed for read, err=11, t1=10, t2=0

	Dec 13 14:13:50 topspin-cc chassis_mgr.x[1084]: [INFO]: ipc:
select(fd=28) failed for write, err=11, t1=10, t2=0

	Dec 13 14:13:50 topspin-cc chassis_mgr.x[1084]: [INFO]:
tsIpcMessageSend failed, fd=28, vp=2, err=11, Resource temporarily
unavailable

	Dec 13 14:14:18 topspin-cc ib_sm.x[1383]: [INFO]: Session not
initiated: Cold Sync Limit exceeded for Standby SM guid
00:05:ad:00:00:08:94:5d

	Dec 13 14:14:38 topspin-cc chassis_mgr.x[1084]: [INFO]: ipc:
select(fd=28) failed for write, err=11, t1=10, t2=0

	Dec 13 14:14:38 topspin-cc chassis_mgr.x[1084]: [INFO]:
tsIpcMessageSend failed, fd=28, vp=2, err=11, Resource temporarily
unavailable

	Dec 13 14:14:40 topspin-cc snmp_agent.x[1208]: [INFO]: ipc:
select(fd=5) failed for read, err=11, t1=10, t2=0

	Dec 13 14:14:49 topspin-cc chassis_mgr.x[1084]: [INFO]: ipc:
select(fd=28) failed for write, err=11, t1=10, t2=0

	Dec 13 14:14:50 topspin-cc chassis_mgr.x[1084]: [INFO]:
tsIpcMessageSend failed, fd=28, vp=2, err=11, Resource temporarily
unavailable

	Dec 13 14:14:50 topspin-cc snmp_agent.x[1208]: [INFO]: ipc:
select(fd=5) failed for read, err=11, t1=10, t2=0

	Dec 13 14:15:00 topspin-cc web_agent.x[1370]: [INFO]: ipc:
select(fd=3) failed for read, err=11, t1=10, t2=0

	Dec 13 14:15:00 topspin-cc chassis_mgr.x[1084]: [INFO]: ipc:
select(fd=28) failed for write, err=11, t1=10, t2=0

	Dec 13 14:15:00 topspin-cc chassis_mgr.x[1084]: [INFO]:
tsIpcMessageSend failed, fd=28, vp=2, err=11, Resource temporarily
unavailable

	Dec 13 14:15:00 topspin-cc snmp_agent.x[1208]: [INFO]: ipc:
select(fd=5) failed for read, err=11, t1=10, t2=0

	It looks like some of the log entries are incomplete.

	I think it is a switch related issue: first of all because of
the strange format of the logs, and second because when this error
occurs in the switch, no SRP communication is possible on either of the
IB hosts. I already tried increasing the Node timeout, and set
RENICE_IB_MAD to yes as described in this thread:
http://lists.openfabrics.org/pipermail/general/2007-May/036465.html. But
this didn't help.

	This issue occurs randomly.  So it isn't easily reproduced.

	Does anybody have an idea what went wrong?

	Thanks in advance!

	Jeroen Van Aken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20071219/141fc508/attachment.html>