[ofa-general] SPAM SFS 3012 SRP problem

Wed Dec 19 06:54:13 PST 2007

Hello

We are doing some SRP tests with the Cisco SFS 3012 Gateway. We connected 4
hosts, each with 2 infiniband cables on one dual infiniband card to the
SFS3012 gateway. The gateway is also connected to our fibre channel storage.
The ofed used is OFED-1.3-beta2 on each of the hosts. The infiniband cards
used are InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (rev
a0) and  Mellanox Technologies MT23108 InfiniHost (rev a1) cards.

When generating heavy load over the switch (by reading from our FC storage
over all the luns simultaneously), we sometimes get the following errors:

On the hosts: 

Dec 13 13:07:54 gpfs4n1 syslog-ng[8212]: STATS: dropped 0

Dec 13 13:20:26 gpfs4n1 run_srp_daemon[8422]: failed srp_daemon:
[HCA=mthca0] [port=1] [exit status=110]. Will try to restart srp_daemon
periodically. No mor

e warnings will be issued in the next 7200 seconds if the same problem
repeats

Dec 13 13:20:27 gpfs4n1 run_srp_daemon[8428]: starting srp_daemon:
[HCA=mthca0] [port=1]

Dec 13 14:01:20 gpfs4n1 sshd[8539]: Accepted keyboard-interactive/pam for
root from 172.16.0.18 port 3545 ssh2

Dec 13 14:07:55 gpfs4n1 syslog-ng[8212]: STATS: dropped 0

Dec 13 14:13:01 gpfs4n1 syslog-ng[8212]: Changing permissions on special
file /dev/xconsole

Dec 13 14:13:01 gpfs4n1 syslog-ng[8212]: Changing permissions on special
file /dev/tty10

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed send status 12

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed send status 12

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

On the switch ts_log

**************************************SWITCH
LOG***************************************************************

Dec 13 14:04:30 topspin-cc ib_sm.x[1357]: [INFO]: Configuration caused by
multicast membership change

Dec 13 14:05:49 topspin-cc ib_sm.x[1383]: [INFO]: Session not initiated:
Cold Sync Limit exceeded for Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:07:49 topspin-cc ib_sm.x[1383]: [INFO]: Initialize a backup
session with Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:07:59 topspin-cc ib_sm.x[1383]: [INFO]: Session initialization
failed with Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:09:59 topspin-cc ib_sm.x[1383]: [INFO]: Initialize a backup
session with Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:10:09 topspin-cc ib_sm.x[1383]: [INFO]: Session initialization
failed with Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:12:06 topspin-cc ib_sm.x[1357]: [INFO]: Generate SM OUT_OF_SERVICE
trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:1d:ce:21

Dec 13 14:12:06 topspin-cc ib_sm.x[1357]: [INFO]: Generate SM OUT_OF_SERVICE
trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:1d:ce:22

Dec 13 14:12:06 topspin-cc ib_sm.x[1357]: [INFO]: Configuration caused by
discovering removed ports

Dec 13 14:12:07 topspin-cc ib_sm.x[1357]: [INFO]: Configuration caused by
multicast membership change

Dec 13 14:12:09 topspin-cc ib_sm.x[1383]: [INFO]: Session not initiated:
Cold Sync Limit exceeded for Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:12:18 topspin-cc chassis_mgr.x[1084]: [INFO]: ipc: select(fd=28)
failed for read, err=11, t1=1, t2=0

Dec 13 14:12:22 topspin-cc last message repeated 4 times

Dec 13 14:12:36 topspin-cc ib_sm.x[1357]: [INFO]: Configuration caused by
discovering new ports

Dec 13 14:12:37 topspin-cc ib_sm.x[1357]: [INFO]: Generate SM IN_SERVICE
trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:1d:ce:21

Dec 13 14:12:37 topspin-cc ib_sm.x[1357]: [INFO]: Generate SM IN_SERVICE
trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:1d:ce:22

13 14:12:3

Dec 13 14:12:38 topspin-cc ib_sm.x[1357]: [INFO]: Configuration caused by
multicast membership change

13 14:12:3

13 14:12:3

13 14:12:3

13 14:12:3

13 14:12:3

13 14:12:3

13 14:12:3

13 14:12:3

13 14:12:3

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

13 14:12:4

Dec 13 14:13:28 topspin-cc chassis_mgr.x[1084]: [WARN]: tsIpcMessageSend
failed, fd=28, vp=2, err=104, Connection reset by peer

Dec 13 14:13:39 topspin-cc chassis_mgr.x[1084]: [INFO]: ipc: select(fd=28)
failed for write, err=11, t1=10, t2=0

Dec 13 14:13:39 topspin-cc chassis_mgr.x[1084]: [INFO]: tsIpcMessageSend
failed, fd=28, vp=2, err=11, Resource temporarily unavailable

Dec 13 14:13:40 topspin-cc snmp_agent.x[1208]: [INFO]: ipc: select(fd=5)
failed for read, err=11, t1=10, t2=0

Dec 13 14:13:46 topspin-cc web_agent.x[1370]: [INFO]: ipc: select(fd=3)
failed for read, err=11, t1=10, t2=0

Dec 13 14:13:50 topspin-cc snmp_agent.x[1208]: [INFO]: ipc: select(fd=5)
failed for read, err=11, t1=10, t2=0

Dec 13 14:13:50 topspin-cc chassis_mgr.x[1084]: [INFO]: ipc: select(fd=28)
failed for write, err=11, t1=10, t2=0

Dec 13 14:13:50 topspin-cc chassis_mgr.x[1084]: [INFO]: tsIpcMessageSend
failed, fd=28, vp=2, err=11, Resource temporarily unavailable

Dec 13 14:14:18 topspin-cc ib_sm.x[1383]: [INFO]: Session not initiated:
Cold Sync Limit exceeded for Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:14:38 topspin-cc chassis_mgr.x[1084]: [INFO]: ipc: select(fd=28)
failed for write, err=11, t1=10, t2=0

Dec 13 14:14:38 topspin-cc chassis_mgr.x[1084]: [INFO]: tsIpcMessageSend
failed, fd=28, vp=2, err=11, Resource temporarily unavailable

Dec 13 14:14:40 topspin-cc snmp_agent.x[1208]: [INFO]: ipc: select(fd=5)
failed for read, err=11, t1=10, t2=0

Dec 13 14:14:49 topspin-cc chassis_mgr.x[1084]: [INFO]: ipc: select(fd=28)
failed for write, err=11, t1=10, t2=0

Dec 13 14:14:50 topspin-cc chassis_mgr.x[1084]: [INFO]: tsIpcMessageSend
failed, fd=28, vp=2, err=11, Resource temporarily unavailable

Dec 13 14:14:50 topspin-cc snmp_agent.x[1208]: [INFO]: ipc: select(fd=5)
failed for read, err=11, t1=10, t2=0

Dec 13 14:15:00 topspin-cc web_agent.x[1370]: [INFO]: ipc: select(fd=3)
failed for read, err=11, t1=10, t2=0

Dec 13 14:15:00 topspin-cc chassis_mgr.x[1084]: [INFO]: ipc: select(fd=28)
failed for write, err=11, t1=10, t2=0

Dec 13 14:15:00 topspin-cc chassis_mgr.x[1084]: [INFO]: tsIpcMessageSend
failed, fd=28, vp=2, err=11, Resource temporarily unavailable

Dec 13 14:15:00 topspin-cc snmp_agent.x[1208]: [INFO]: ipc: select(fd=5)
failed for read, err=11, t1=10, t2=0

It looks like some of the log entries are incomplete.

I think it is a switch related issue: first of all because of the strange
format of the logs, and second because when this error occurs in the switch,
no SRP communication is possible on either of the IB hosts. I already tried
increasing the Node timeout, and set RENICE_IB_MAD to yes as described in
this thread:
http://lists.openfabrics.org/pipermail/general/2007-May/036465.html. But
this didn't help.

This issue occurs randomly.  So it isn't easily reproduced.

Does anybody have an idea what went wrong?

Thanks in advance!

Jeroen Van Aken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20071219/6e008ff8/attachment.html>

[ofa-general] ***SPAM*** SFS 3012 SRP problem

[ofa-general] SPAM SFS 3012 SRP problem