<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML xmlns="http://www.w3.org/TR/REC-html40" xmlns:v =
"urn:schemas-microsoft-com:vml" xmlns:o =
"urn:schemas-microsoft-com:office:office" xmlns:w =
"urn:schemas-microsoft-com:office:word" xmlns:m =
"http://schemas.microsoft.com/office/2004/12/omml"><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2900.3199" name=GENERATOR>
<STYLE>@font-face {
font-family: Cambria Math;
}
@font-face {
font-family: Calibri;
}
@page Section1 {size: 612.0pt 792.0pt; margin: 72.0pt 72.0pt 72.0pt 72.0pt; }
P.MsoNormal {
FONT-SIZE: 11pt; MARGIN: 0cm 0cm 0pt; FONT-FAMILY: "Calibri","sans-serif"
}
LI.MsoNormal {
FONT-SIZE: 11pt; MARGIN: 0cm 0cm 0pt; FONT-FAMILY: "Calibri","sans-serif"
}
DIV.MsoNormal {
FONT-SIZE: 11pt; MARGIN: 0cm 0cm 0pt; FONT-FAMILY: "Calibri","sans-serif"
}
A:link {
COLOR: blue; TEXT-DECORATION: underline; mso-style-priority: 99
}
SPAN.MsoHyperlink {
COLOR: blue; TEXT-DECORATION: underline; mso-style-priority: 99
}
A:visited {
COLOR: purple; TEXT-DECORATION: underline; mso-style-priority: 99
}
SPAN.MsoHyperlinkFollowed {
COLOR: purple; TEXT-DECORATION: underline; mso-style-priority: 99
}
SPAN.EmailStyle17 {
COLOR: windowtext; FONT-FAMILY: "Calibri","sans-serif"; mso-style-type: personal-compose
}
.MsoChpDefault {
mso-style-type: export-only
}
DIV.Section1 {
page: Section1
}
</STYLE>
<!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></HEAD>
<BODY lang=EN-US vLink=purple link=blue>
<DIV dir=ltr align=left><SPAN class=406242917-19122007><FONT face=Arial
color=#0000ff size=2>If you have a Cisco supoport contract, you should open a
case with the Cisco TAC.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=406242917-19122007><FONT face=Arial
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=406242917-19122007><FONT face=Arial
color=#0000ff size=2>What kind of FC storage are you using?</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=406242917-19122007><FONT face=Arial
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=406242917-19122007><FONT face=Arial
color=#0000ff size=2>The chassis syslog message show the host is unresponsive
(the OUT_SERVICE and IN_SERVICE message). Do the timing of these messages
match the ib_srp messages on the host?</FONT></SPAN></DIV>
<DIV><FONT color=#0000ff></FONT> </DIV>
<DIV align=left><FONT face=Arial size=2><FONT color=#0000ff size=2>
<P align=left>Scott Weitzenkamp<BR>SQA and Release Manager<BR>Server
Virtualization Business Unit<BR>Cisco Systems<BR></P></FONT></FONT></DIV>
<DIV> </DIV><BR>
<BLOCKQUOTE dir=ltr
style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #0000ff 2px solid; MARGIN-RIGHT: 0px">
<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> general-bounces@lists.openfabrics.org
[mailto:general-bounces@lists.openfabrics.org] <B>On Behalf Of </B>Jeroen Van
Aken<BR><B>Sent:</B> Wednesday, December 19, 2007 6:54 AM<BR><B>To:</B>
general@lists.openfabrics.org<BR><B>Subject:</B> [ofa-general] ***SPAM*** SFS
3012 SRP problem<BR></FONT><BR></DIV>
<DIV></DIV>
<DIV class=Section1>
<P class=MsoNormal>Hello<o:p></o:p></P>
<P class=MsoNormal><o:p> </o:p></P>
<P class=MsoNormal>We are doing some SRP tests with the Cisco SFS 3012
Gateway. We connected 4 hosts, each with 2 infiniband cables on one dual
infiniband card to the SFS3012 gateway. The gateway is also connected to our
fibre channel storage. The ofed used is OFED-1.3-beta2 on each of the
hosts. The infiniband cards used are InfiniBand: Mellanox Technologies MT25208
InfiniHost III Ex (rev a0) and Mellanox Technologies MT23108 InfiniHost
(rev a1) cards.<o:p></o:p></P>
<P class=MsoNormal>When generating heavy load over the switch (by reading from
our FC storage over all the luns simultaneously), we sometimes get the
following errors:<o:p></o:p></P>
<P class=MsoNormal>On the hosts: <o:p></o:p></P>
<P class=MsoNormal><o:p> </o:p></P>
<P class=MsoNormal>Dec 13 13:07:54 gpfs4n1 syslog-ng[8212]: STATS: dropped
0<o:p></o:p></P>
<P class=MsoNormal>Dec 13 13:20:26 gpfs4n1 run_srp_daemon[8422]: failed
srp_daemon: [HCA=mthca0] [port=1] [exit status=110]. Will try to restart
srp_daemon periodically. No mor<o:p></o:p></P>
<P class=MsoNormal>e warnings will be issued in the next 7200 seconds if the
same problem repeats<o:p></o:p></P>
<P class=MsoNormal>Dec 13 13:20:27 gpfs4n1 run_srp_daemon[8428]: starting
srp_daemon: [HCA=mthca0] [port=1]<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:01:20 gpfs4n1 sshd[8539]: Accepted
keyboard-interactive/pam for root from 172.16.0.18 port 3545
ssh2<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:07:55 gpfs4n1 syslog-ng[8212]: STATS: dropped
0<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:01 gpfs4n1 syslog-ng[8212]: Changing
permissions on special file /dev/xconsole<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:01 gpfs4n1 syslog-ng[8212]: Changing
permissions on special file /dev/tty10<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:01 gpfs4n1 kernel: SRP abort
called<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:01 gpfs4n1 kernel: SRP abort
called<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:01 gpfs4n1 kernel: SRP abort
called<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:01 gpfs4n1 kernel: SRP abort
called<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:01 gpfs4n1 kernel: SRP abort
called<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:01 gpfs4n1 kernel: SRP abort
called<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:01 gpfs4n1 kernel: SRP abort
called<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:01 gpfs4n1 kernel: SRP abort
called<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed send status
12<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive
status 5<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive
status 5<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive
status 5<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive
status 5<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive
status 5<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive
status 5<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed send status
12<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive
status 5<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive
status 5<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive
status 5<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive
status 5<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive
status 5<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive
status 5<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive
status 5<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive
status 5<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive
status 5<o:p></o:p></P>
<P class=MsoNormal><o:p> </o:p></P>
<P class=MsoNormal>On the switch ts_log<o:p></o:p></P>
<P class=MsoNormal>**************************************SWITCH
LOG***************************************************************<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:04:30 topspin-cc ib_sm.x[1357]: [INFO]:
Configuration caused by multicast membership change<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:05:49 topspin-cc ib_sm.x[1383]: [INFO]: Session
not initiated: Cold Sync Limit exceeded for Standby SM guid
00:05:ad:00:00:08:94:5d<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:07:49 topspin-cc ib_sm.x[1383]: [INFO]:
Initialize a backup session with Standby SM guid
00:05:ad:00:00:08:94:5d<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:07:59 topspin-cc ib_sm.x[1383]: [INFO]: Session
initialization failed with Standby SM guid
00:05:ad:00:00:08:94:5d<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:09:59 topspin-cc ib_sm.x[1383]: [INFO]:
Initialize a backup session with Standby SM guid
00:05:ad:00:00:08:94:5d<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:10:09 topspin-cc ib_sm.x[1383]: [INFO]: Session
initialization failed with Standby SM guid
00:05:ad:00:00:08:94:5d<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:12:06 topspin-cc ib_sm.x[1357]: [INFO]: Generate
SM OUT_OF_SERVICE trap for
GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:1d:ce:21<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:12:06 topspin-cc ib_sm.x[1357]: [INFO]: Generate
SM OUT_OF_SERVICE trap for
GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:1d:ce:22<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:12:06 topspin-cc ib_sm.x[1357]: [INFO]:
Configuration caused by discovering removed ports<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:12:07 topspin-cc ib_sm.x[1357]: [INFO]:
Configuration caused by multicast membership change<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:12:09 topspin-cc ib_sm.x[1383]: [INFO]: Session
not initiated: Cold Sync Limit exceeded for Standby SM guid
00:05:ad:00:00:08:94:5d<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:12:18 topspin-cc chassis_mgr.x[1084]: [INFO]:
ipc: select(fd=28) failed for read, err=11, t1=1, t2=0<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:12:22 topspin-cc last message repeated 4
times<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:12:36 topspin-cc ib_sm.x[1357]: [INFO]:
Configuration caused by discovering new ports<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:12:37 topspin-cc ib_sm.x[1357]: [INFO]: Generate
SM IN_SERVICE trap for
GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:1d:ce:21<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:12:37 topspin-cc ib_sm.x[1357]: [INFO]: Generate
SM IN_SERVICE trap for
GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:1d:ce:22<o:p></o:p></P>
<P class=MsoNormal>13 14:12:3<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:12:38 topspin-cc ib_sm.x[1357]: [INFO]:
Configuration caused by multicast membership change<o:p></o:p></P>
<P class=MsoNormal>13 14:12:3<o:p></o:p></P>
<P class=MsoNormal>13 14:12:3<o:p></o:p></P>
<P class=MsoNormal>13 14:12:3<o:p></o:p></P>
<P class=MsoNormal>13 14:12:3<o:p></o:p></P>
<P class=MsoNormal>13 14:12:3<o:p></o:p></P>
<P class=MsoNormal>13 14:12:3<o:p></o:p></P>
<P class=MsoNormal>13 14:12:3<o:p></o:p></P>
<P class=MsoNormal>13 14:12:3<o:p></o:p></P>
<P class=MsoNormal>13 14:12:3<o:p></o:p></P>
<P class=MsoNormal>13 14:12:4<o:p></o:p></P>
<P class=MsoNormal>13 14:12:4<o:p></o:p></P>
<P class=MsoNormal>13 14:12:4<o:p></o:p></P>
<P class=MsoNormal>13 14:12:4<o:p></o:p></P>
<P class=MsoNormal>13 14:12:4<o:p></o:p></P>
<P class=MsoNormal>13 14:12:4<o:p></o:p></P>
<P class=MsoNormal>13 14:12:4<o:p></o:p></P>
<P class=MsoNormal>13 14:12:4<o:p></o:p></P>
<P class=MsoNormal>13 14:12:4<o:p></o:p></P>
<P class=MsoNormal>13 14:12:4<o:p></o:p></P>
<P class=MsoNormal>13 14:12:4<o:p></o:p></P>
<P class=MsoNormal>13 14:12:4<o:p></o:p></P>
<P class=MsoNormal>13 14:12:4<o:p></o:p></P>
<P class=MsoNormal>13 14:12:4<o:p></o:p></P>
<P class=MsoNormal>13 14:12:4<o:p></o:p></P>
<P class=MsoNormal>13 14:12:4<o:p></o:p></P>
<P class=MsoNormal>13 14:12:4<o:p></o:p></P>
<P class=MsoNormal>13 14:12:4<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:28 topspin-cc chassis_mgr.x[1084]: [WARN]:
tsIpcMessageSend failed, fd=28, vp=2, err=104, Connection reset by
peer<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:39 topspin-cc chassis_mgr.x[1084]: [INFO]:
ipc: select(fd=28) failed for write, err=11, t1=10, t2=0<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:39 topspin-cc chassis_mgr.x[1084]: [INFO]:
tsIpcMessageSend failed, fd=28, vp=2, err=11, Resource temporarily
unavailable<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:40 topspin-cc snmp_agent.x[1208]: [INFO]: ipc:
select(fd=5) failed for read, err=11, t1=10, t2=0<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:46 topspin-cc web_agent.x[1370]: [INFO]: ipc:
select(fd=3) failed for read, err=11, t1=10, t2=0<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:50 topspin-cc snmp_agent.x[1208]: [INFO]: ipc:
select(fd=5) failed for read, err=11, t1=10, t2=0<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:50 topspin-cc chassis_mgr.x[1084]: [INFO]:
ipc: select(fd=28) failed for write, err=11, t1=10, t2=0<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:13:50 topspin-cc chassis_mgr.x[1084]: [INFO]:
tsIpcMessageSend failed, fd=28, vp=2, err=11, Resource temporarily
unavailable<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:14:18 topspin-cc ib_sm.x[1383]: [INFO]: Session
not initiated: Cold Sync Limit exceeded for Standby SM guid
00:05:ad:00:00:08:94:5d<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:14:38 topspin-cc chassis_mgr.x[1084]: [INFO]:
ipc: select(fd=28) failed for write, err=11, t1=10, t2=0<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:14:38 topspin-cc chassis_mgr.x[1084]: [INFO]:
tsIpcMessageSend failed, fd=28, vp=2, err=11, Resource temporarily
unavailable<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:14:40 topspin-cc snmp_agent.x[1208]: [INFO]: ipc:
select(fd=5) failed for read, err=11, t1=10, t2=0<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:14:49 topspin-cc chassis_mgr.x[1084]: [INFO]:
ipc: select(fd=28) failed for write, err=11, t1=10, t2=0<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:14:50 topspin-cc chassis_mgr.x[1084]: [INFO]:
tsIpcMessageSend failed, fd=28, vp=2, err=11, Resource temporarily
unavailable<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:14:50 topspin-cc snmp_agent.x[1208]: [INFO]: ipc:
select(fd=5) failed for read, err=11, t1=10, t2=0<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:15:00 topspin-cc web_agent.x[1370]: [INFO]: ipc:
select(fd=3) failed for read, err=11, t1=10, t2=0<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:15:00 topspin-cc chassis_mgr.x[1084]: [INFO]:
ipc: select(fd=28) failed for write, err=11, t1=10, t2=0<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:15:00 topspin-cc chassis_mgr.x[1084]: [INFO]:
tsIpcMessageSend failed, fd=28, vp=2, err=11, Resource temporarily
unavailable<o:p></o:p></P>
<P class=MsoNormal>Dec 13 14:15:00 topspin-cc snmp_agent.x[1208]: [INFO]: ipc:
select(fd=5) failed for read, err=11, t1=10, t2=0<o:p></o:p></P>
<P class=MsoNormal><o:p> </o:p></P>
<P class=MsoNormal>It looks like some of the log entries are
incomplete.<o:p></o:p></P>
<P class=MsoNormal>I think it is a switch related issue: first of all because
of the strange format of the logs, and second because when this error occurs
in the switch, no SRP communication is possible on either of the IB hosts. I
already tried increasing the Node timeout, and set RENICE_IB_MAD to yes as
described in this thread: <A
href="http://lists.openfabrics.org/pipermail/general/2007-May/036465.html">http://lists.openfabrics.org/pipermail/general/2007-May/036465.html</A>.
But this didn’t help.<o:p></o:p></P>
<P class=MsoNormal>This issue occurs randomly. So it isn’t easily
reproduced.<o:p></o:p></P>
<P class=MsoNormal>Does anybody have an idea what went wrong?<o:p></o:p></P>
<P class=MsoNormal><o:p> </o:p></P>
<P class=MsoNormal>Thanks in advance!<o:p></o:p></P>
<P class=MsoNormal><o:p> </o:p></P>
<P class=MsoNormal>Jeroen Van Aken<o:p></o:p></P>
<P class=MsoNormal><o:p> </o:p></P></DIV></BLOCKQUOTE></BODY></HTML>