SPAM RE: [ofa-general] SPAM SFS 3012 SRP problem

Thu Dec 20 01:07:12 PST 2007

We are using 2 IBM FAStT900's.

Normally the timestamps of the messages on both the SFS and the IB host
match.

Thanks

jeroen

From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
Sent: woensdag 19 december 2007 18:32
To: Jeroen Van Aken; general at lists.openfabrics.org
Subject: RE: [ofa-general] ***SPAM*** SFS 3012 SRP problem

If you have a Cisco supoport contract, you should open a case with the Cisco
TAC.

What kind of FC storage are you using?

The chassis syslog message show the host is unresponsive (the OUT_SERVICE
and IN_SERVICE message).  Do the timing of these messages match the ib_srp
messages on the host?

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems

  _____  

From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jeroen Van Aken
Sent: Wednesday, December 19, 2007 6:54 AM
To: general at lists.openfabrics.org
Subject: [ofa-general] ***SPAM*** SFS 3012 SRP problem

Hello

We are doing some SRP tests with the Cisco SFS 3012 Gateway. We connected 4
hosts, each with 2 infiniband cables on one dual infiniband card to the
SFS3012 gateway. The gateway is also connected to our fibre channel storage.
The ofed used is OFED-1.3-beta2 on each of the hosts. The infiniband cards
used are InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (rev
a0) and  Mellanox Technologies MT23108 InfiniHost (rev a1) cards.

When generating heavy load over the switch (by reading from our FC storage
over all the luns simultaneously), we sometimes get the following errors:

On the hosts: 

Dec 13 13:07:54 gpfs4n1 syslog-ng[8212]: STATS: dropped 0

Dec 13 13:20:26 gpfs4n1 run_srp_daemon[8422]: failed srp_daemon:
[HCA=mthca0] [port=1] [exit status=110]. Will try to restart srp_daemon
periodically. No mor

e warnings will be issued in the next 7200 seconds if the same problem
repeats

Dec 13 13:20:27 gpfs4n1 run_srp_daemon[8428]: starting srp_daemon:
[HCA=mthca0] [port=1]

Dec 13 14:01:20 gpfs4n1 sshd[8539]: Accepted keyboard-interactive/pam for
root from 172.16.0.18 port 3545 ssh2

Dec 13 14:07:55 gpfs4n1 syslog-ng[8212]: STATS: dropped 0

Dec 13 14:13:01 gpfs4n1 syslog-ng[8212]: Changing permissions on special
file /dev/xconsole

Dec 13 14:13:01 gpfs4n1 syslog-ng[8212]: Changing permissions on special
file /dev/tty10

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed send status 12

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed send status 12

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

On the switch ts_log

**************************************SWITCH
LOG***************************************************************

Dec 13 14:04:30 topspin-cc ib_sm.x[1357]: [INFO]: Configuration caused by
multicast membership change

Dec 13 14:05:49 topspin-cc ib_sm.x[1383]: [INFO]: Session not initiated:
Cold Sync Limit exceeded for Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:07:49 topspin-cc ib_sm.x[1383]: [INFO]: Initialize a backup
session with Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:07:59 topspin-cc ib_sm.x[1383]: [INFO]: Session initialization
failed with Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:09:59 topspin-cc ib_sm.x[1383]: [INFO]: Initialize a backup
session with Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:10:09 topspin-cc ib_sm.x[1383]: [INFO]: Session initialization
failed with Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:12:06 topspin-cc ib_sm.x[1357]: [INFO]: Generate SM OUT_OF_SERVICE
trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:1d:ce:21

Dec 13 14:12:06 topspin-cc ib_sm.x[1357]: [INFO]: Generate SM OUT_OF_SERVICE
trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:1d:ce:22

Dec 13 14:12:06 topspin-cc ib_sm.x[1357]: [INFO]: Configuration caused by
discovering removed ports

Dec 13 14:12:07 topspin-cc ib_sm.x[1357]: [INFO]: Configuration caused by
multicast membership change

Dec 13 14:12:09 topspin-cc ib_sm.x[1383]: [INFO]: Session not initiated:
Cold Sync Limit exceeded for Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:12:18 topspin-cc chassis_mgr.x[1084]: [INFO]: ipc: select(fd=28)
failed for read, err=11, t1=1, t2=0

Dec 13 14:12:22 topspin-cc last message repeated 4 times

Dec 13 14:12:36 topspin-cc ib_sm.x[1357]: [INFO]: Configuration caused by
discovering new ports

Dec 13 14:12:37 topspin-cc ib_sm.x[1357]: [INFO]: Generate SM IN_SERVICE
trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:1d:ce:21

Dec 13 14:12:37 topspin-cc ib_sm.x[1357]: [INFO]: Generate SM IN_SERVICE
trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:1d:ce:22

13 14:12:3

Dec 13 14:12:38 topspin-cc ib_sm.x[1357]: [INFO]: Configuration caused by
multicast membership change

13 14:12:3

13 14:12:3

13 14:12:3

13 14:12:3

13 14:12:3

13 14:12:3

13 14:12:3

13 14:12:3

13 14:12:3

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

Dec 13 14:13:28 topspin-cc chassis_mgr.x[1084]: [WARN]: tsIpcMessageSend
failed, fd=28, vp=2, err=104, Connection reset by peer

Dec 13 14:13:39 topspin-cc chassis_mgr.x[1084]: [INFO]: ipc: select(fd=28)
failed for write, err=11, t1=10, t2=0

Dec 13 14:13:39 topspin-cc chassis_mgr.x[1084]: [INFO]: tsIpcMessageSend
failed, fd=28, vp=2, err=11, Resource temporarily unavailable

Dec 13 14:13:40 topspin-cc snmp_agent.x[1208]: [INFO]: ipc: select(fd=5)
failed for read, err=11, t1=10, t2=0

Dec 13 14:13:46 topspin-cc web_agent.x[1370]: [INFO]: ipc: select(fd=3)
failed for read, err=11, t1=10, t2=0

Dec 13 14:13:50 topspin-cc snmp_agent.x[1208]: [INFO]: ipc: select(fd=5)
failed for read, err=11, t1=10, t2=0

Dec 13 14:13:50 topspin-cc chassis_mgr.x[1084]: [INFO]: ipc: select(fd=28)
failed for write, err=11, t1=10, t2=0

Dec 13 14:13:50 topspin-cc chassis_mgr.x[1084]: [INFO]: tsIpcMessageSend
failed, fd=28, vp=2, err=11, Resource temporarily unavailable

Dec 13 14:14:18 topspin-cc ib_sm.x[1383]: [INFO]: Session not initiated:
Cold Sync Limit exceeded for Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:14:38 topspin-cc chassis_mgr.x[1084]: [INFO]: ipc: select(fd=28)
failed for write, err=11, t1=10, t2=0

Dec 13 14:14:38 topspin-cc chassis_mgr.x[1084]: [INFO]: tsIpcMessageSend
failed, fd=28, vp=2, err=11, Resource temporarily unavailable

Dec 13 14:14:40 topspin-cc snmp_agent.x[1208]: [INFO]: ipc: select(fd=5)
failed for read, err=11, t1=10, t2=0

Dec 13 14:14:49 topspin-cc chassis_mgr.x[1084]: [INFO]: ipc: select(fd=28)
failed for write, err=11, t1=10, t2=0

Dec 13 14:14:50 topspin-cc chassis_mgr.x[1084]: [INFO]: tsIpcMessageSend
failed, fd=28, vp=2, err=11, Resource temporarily unavailable

Dec 13 14:14:50 topspin-cc snmp_agent.x[1208]: [INFO]: ipc: select(fd=5)
failed for read, err=11, t1=10, t2=0

Dec 13 14:15:00 topspin-cc web_agent.x[1370]: [INFO]: ipc: select(fd=3)
failed for read, err=11, t1=10, t2=0

Dec 13 14:15:00 topspin-cc chassis_mgr.x[1084]: [INFO]: ipc: select(fd=28)
failed for write, err=11, t1=10, t2=0

Dec 13 14:15:00 topspin-cc chassis_mgr.x[1084]: [INFO]: tsIpcMessageSend
failed, fd=28, vp=2, err=11, Resource temporarily unavailable

Dec 13 14:15:00 topspin-cc snmp_agent.x[1208]: [INFO]: ipc: select(fd=5)
failed for read, err=11, t1=10, t2=0

It looks like some of the log entries are incomplete.

I think it is a switch related issue: first of all because of the strange
format of the logs, and second because when this error occurs in the switch,
no SRP communication is possible on either of the IB hosts. I already tried
increasing the Node timeout, and set RENICE_IB_MAD to yes as described in
this thread:
http://lists.openfabrics.org/pipermail/general/2007-May/036465.html. But
this didn't help.

This issue occurs randomly.  So it isn't easily reproduced.

Does anybody have an idea what went wrong?

Thanks in advance!

Jeroen Van Aken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20071220/d75aa72d/attachment.html>

***SPAM*** RE: [ofa-general] ***SPAM*** SFS 3012 SRP problem

SPAM RE: [ofa-general] SPAM SFS 3012 SRP problem