<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE></TITLE>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2900.3395" name=GENERATOR></HEAD>
<BODY>
<DIV dir=ltr align=left><SPAN class=294154711-17092008><FONT face=Arial
color=#0000ff size=2>Please see
my questions inline.</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=294154711-17092008></SPAN><SPAN
class=294154711-17092008><FONT face=Arial color=#0000ff
size=2>Thanks,</FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=294154711-17092008><FONT face=Arial
color=#0000ff size=2>Alex.</FONT></SPAN></DIV><BR>
<BLOCKQUOTE dir=ltr
style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #0000ff 2px solid; MARGIN-RIGHT: 0px">
<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> ofw-bounces@lists.openfabrics.org
[mailto:ofw-bounces@lists.openfabrics.org] <B>On Behalf Of </B>Alex
Naslednikov<BR><B>Sent:</B> Wednesday, September 17, 2008 4:47
AM<BR><B>To:</B> ofw@lists.openfabrics.org; Fab Tillier; Leonid
Keller<BR><B>Subject:</B> [ofw] [IPoIB] Problem with "Avoid the SM" patch
<BR></FONT><BR></DIV>
<DIV></DIV><!-- Converted from text/plain format -->
<P><FONT size=2><FONT size=3>Reposting this issue to the whole
community.<BR><U>Current Problem:<BR></U>"Avoid the SM" patch caused IPoIB
for an invalid cache cleaning. That is, when restarting an opensm,
IPoIB communication stop to work (including pings)</FONT></FONT><FONT
size=2><FONT size=3><BR><U>Detailed Description:</U><BR>1.1 After restarting
an opensm, __endpt_mgr_reset_all() sets dlid==0 for the enpoints.<BR>Thus, ARP
should be sent in order to resume the normal communication.<BR>1.2 ARP indeed
was sent, and even received by the remote side.<BR>But (put attention), we
send ARP by broadcast, but the ARP response is always unicast with INVALID
DLID.<BR>Thus, normal communication can't be resumed without ARP response, and
ARP response can't be send without valid dlid.</FONT></FONT></P>
<DIV><FONT size=2><FONT size=3><U>Proposed Solution:</U></FONT></FONT></DIV>
<DIV><FONT size=2><FONT size=3>2.1. When receiving an ARP request dlid is
equal to zero, delete this endpoint and recreate it.<BR>2.2 In order to
initialize ARP table (and thus generate ARP requests), notify to NDIS link
down/link up</FONT></FONT><SPAN class=294154711-17092008><FONT face=Arial
color=#0000ff size=2> </FONT></SPAN></DIV>
<DIV><SPAN class=294154711-17092008> </SPAN></DIV>
<DIV><FONT face=Arial><FONT color=#0000ff><FONT size=2><SPAN
class=294154711-17092008><SPAN class=294154711-17092008>Does driver
receives PnP events I<FONT size=2>B_PNP_SM_CHANGE
or</FONT><FONT size=2> </FONT><FONT size=2>IB_PNP_LID_CHANGE
?</FONT></SPAN> </SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT color=#0000ff><FONT size=2>Isn't
__endpt_mgr_reset_all() called in context of ipoib_port_down()
which <SPAN class=294154711-17092008> activated </SPAN>after
OS <SPAN class=294154711-17092008> notified </SPAN>of
MEDIA_DISCONNECT<SPAN
class=294154711-17092008> ,</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT color=#0000ff><FONT size=2><SPAN
class=294154711-17092008>so ARP table will
be reset? </SPAN></FONT></FONT></FONT></DIV>
<DIV><SPAN class=294154711-17092008><FONT face=Arial color=#0000ff
size=2> </FONT></SPAN></DIV>
<DIV><SPAN class=294154711-17092008> </SPAN><U>Checklist (we executed
this checks on 8-node cluster)</U></DIV>
<DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>1. Run opensm and
validate that ping works</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>2. Kill opensm.
Ping still should work</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>3. Restart opensm
on the same node. Ping should work</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>4. Rerun
#2</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>5. Restart opensm
on another node. Ping should work</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>6. Run another
instance of opensm, such that the previous instance will switch to "standby
mode". Ping should work</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial
size=2>OR:</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>6A Run another
instance of opensm, such that the previous instance will remain in "active
mode". </FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>Then kill active
instance. The "standby" instance should enter active mode, and a ping should
remain.</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>Ping should
work</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>7.Run 2 different
instance of opensm. During the run, clear guid2lid file and kill active
instance.</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>Passive instance
will become active and ping still should work</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>8.Change guid2lid
file (change lids only) and restart opensm. Ping should
work</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>IMPORTANT!
Validate here that IPoIB adresses didn't changed, but lids did , so that pings
will be sent to the right host</FONT></SPAN></DIV></DIV>
<DIV><FONT face=Arial color=#0000ff size=2></FONT> </DIV>
<DIV><FONT face=Arial color=#0000ff size=2></FONT> </DIV>
<DIV>Fix to "Avoid the SM" </DIV>
<DIV>signed-off by: Alexander Naslednikov (<A
href="mailto:xalex@mellanox.co.il">xalex at mellanox.co.il</A>)</DIV>
<DIV>===================================================================<BR>---
ipoib_port.c (revision 3149)<BR>+++
ipoib_port.c (working copy)<BR>@@
-2357,6 +2357,11 @@<BR>
/* Out of date! Destroy the
endpoint and replace it. */<BR>
__endpt_mgr_remove( p_port, *pp_src
);<BR>
*pp_src =
NULL;<BR>+
}<BR>+
else if ( ! ((*pp_src)->dlid))
{<BR>+
/* Out of date! Destroy the
endpoint and replace it. */<BR>+
__endpt_mgr_remove( p_port, *pp_src
);<BR>+
*pp_src =
NULL;<BR>
}<BR>
else if(
ipoib_is_voltaire_router_gid( &(*pp_src)->dgid )
)<BR>
{<BR>@@ -4153,10 +4158,25
@@<BR> cl_qlist_init( &mc_list
);<BR> <BR>
cl_obj_lock( &p_port->obj
);<BR>+<BR> /* Wait for all readers
to complete. */<BR> while(
p_port->endpt_rdr )<BR>
;<BR>+#if
0<BR>+
__endpt_mgr_remove_all(p_port);<BR>+#else<SPAN class=294154711-17092008><FONT
face=Arial color=#0000ff size=2> </FONT></SPAN></DIV>
<DIV><SPAN class=294154711-17092008> </SPAN><BR><SPAN
class=294154711-17092008><FONT face=Arial color=#0000ff
size=2>Confusing. Why this call commented
out?</FONT></SPAN></DIV>
<DIV><SPAN class=294154711-17092008></SPAN><SPAN
class=294154711-17092008><FONT face=Arial color=#0000ff size=2> Isn't it
called on port destroying and not endpoint
destroying context?</FONT></SPAN></DIV>
<DIV><SPAN class=294154711-17092008><FONT face=Arial color=#0000ff
size=2></FONT></SPAN><FONT face=Arial color=#0000ff
size=2></FONT><BR>+ NdisMIndicateStatus(
p_port->p_adapter->h_adapter,<BR>+
NDIS_STATUS_MEDIA_DISCONNECT, NULL,
0 );<BR>+ NdisMIndicateStatusComplete(
p_port->p_adapter->h_adapter
);<BR>+ <BR>+
NdisMIndicateStatus(
p_port->p_adapter->h_adapter,<BR>+
NDIS_STATUS_MEDIA_CONNECT, NULL, 0
);<BR>+ NdisMIndicateStatusComplete(
p_port->p_adapter->h_adapter
);<BR>+ <BR>+<BR>
if( p_port->p_local_endpt )<BR>
{<BR>
cl_qmap_remove_item(
&p_port->endpt_mgr.mac_endpts,<BR></DIV>
<DIV><FONT size=2><FONT face=Arial color=#0000ff
size=2></FONT></FONT> </DIV>
<DIV><FONT size=2><FONT size=3></FONT> </DIV>
<P><FONT face=Arial color=#0000ff></FONT><FONT face=Arial
color=#0000ff></FONT><FONT face=Arial color=#0000ff></FONT><FONT face=Arial
color=#0000ff></FONT><FONT face=Arial color=#0000ff></FONT><FONT face=Arial
color=#0000ff></FONT><FONT face=Arial color=#0000ff></FONT><FONT face=Arial
color=#0000ff></FONT><FONT face=Arial color=#0000ff></FONT><FONT face=Arial
color=#0000ff></FONT><BR><BR>-----Original Message-----<BR>From: Alex
Naslednikov<BR>Sent: Sunday, September 14, 2008 5:32 PM<BR>To: 'Fab Tillier';
Tzachi Dar; Reuven Amitai; Leonid Keller<BR>Cc: Ishai Rabinovitz<BR>Subject:
RE: [ofw] Problem with "Avoid the SM" patch<BR><BR>I'd like just to summarize
all we said before and to propose a temporarily solution.<BR><BR>1. The
Problem<BR>1.1 After restarting an opensm, __endpt_mgr_reset_all() sets
dlid==0 for the enpoints.<BR>Thus, ARP should be sent in order to resume the
normal communication.<BR>1.2 ARP indeed was sent, and even received by the
remote side.<BR>But (put attention), we send ARP by broadcast, but the ARP
response is always unicast with INVALID DLID.<BR>Thus, normal communication
can't be resumed withoud ARP response, and ARP response can't be send without
valid dlid.<BR><BR>So, in order to resolve it, there's our proposal. It’s a
temporary solution only.<BR>Of course, it should be investigated on a large
cluster<BR><BR>2. The solution<BR>2.1. When receiving an ARP request dlid is
equal to zero, delete this endpoint and recreate it.<BR>2.2 In order to
initialize ARP table (and thus generate ARP requests), notify to NDIS link
down/link
up<BR><BR><BR>===================================================================<BR>---
ipoib_port.c (revision 3149)<BR>+++
ipoib_port.c (working copy)<BR>@@
-2357,6 +2357,11 @@<BR>
/* Out of date! Destroy the
endpoint and replace it. */<BR>
__endpt_mgr_remove( p_port, *pp_src
);<BR>
*pp_src =
NULL;<BR>+
}<BR>+
else if ( ! ((*pp_src)->dlid))
{<BR>+
/* Out of date! Destroy the
endpoint and replace it. */<BR>+
__endpt_mgr_remove( p_port, *pp_src
);<BR>+
*pp_src =
NULL;<BR>
}<BR>
else if(
ipoib_is_voltaire_router_gid( &(*pp_src)->dgid )
)<BR>
{<BR>@@ -4153,10 +4158,25
@@<BR> cl_qlist_init( &mc_list
);<BR> <BR>
cl_obj_lock( &p_port->obj
);<BR>+<BR> /* Wait for all readers
to complete. */<BR> while(
p_port->endpt_rdr )<BR>
;<BR>+#if
0<BR>+
__endpt_mgr_remove_all(p_port);<BR>+#else<BR><BR>+
NdisMIndicateStatus(
p_port->p_adapter->h_adapter,<BR>+
NDIS_STATUS_MEDIA_DISCONNECT, NULL,
0 );<BR>+ NdisMIndicateStatusComplete(
p_port->p_adapter->h_adapter
);<BR>+ <BR>+
NdisMIndicateStatus(
p_port->p_adapter->h_adapter,<BR>+
NDIS_STATUS_MEDIA_CONNECT, NULL, 0
);<BR>+ NdisMIndicateStatusComplete(
p_port->p_adapter->h_adapter
);<BR>+ <BR>+
//
IPOIB_PRINT( TRACE_LEVEL_INFORMATION,
IPOIB_DBG_INIT,<BR>+
//
("Link DOWN!\n") );<BR>+<BR> if(
p_port->p_local_endpt )<BR>
{<BR>
cl_qmap_remove_item(
&p_port->endpt_mgr.mac_endpts,<BR><BR><BR>XaleX<BR>-----Original
Message-----<BR>From: Fab Tillier [<A
href="mailto:ftillier@windows.microsoft.com">mailto:ftillier@windows.microsoft.com</A>]<BR>Sent:
Friday, September 12, 2008 6:00 PM<BR>To: Tzachi Dar; Alex Naslednikov; Reuven
Amitai; Leonid Keller<BR>Cc: ofw@lists.openfabrics.org<BR>Subject: RE: [ofw]
Problem with "Avoid the SM" patch<BR><BR>> Hi Fab,<BR>><BR>> Here is
some more information about the issue and one question.<BR>> There are
currently two problems that we see. Both problems start<BR>> after we
restart opensm.<BR>><BR>> 1) After we restart opensm arp messages don't
pass. The main reason we<BR>> saw so far is that they are sent with the
wrong addresses. Although we<BR>> haven't still found exactly why that is,
we will soon find that and<BR>> fix it.<BR><BR>Is it just a problem with
the ARP responses, or the requests too? The requests should be getting
sent to the broadcast group, so they should work. The response is a
unicast packet, so could be getting lost due to the dlid == 0
issue.<BR><BR>> 2) This is the more problematic issue: After we restart
opensm<BR>> __endpt_mgr_reset_all is being called. As a result all our
endpoint<BR>> cache is cleared. Please note that windows is not aware of
what<BR>> happened and therefore it doesn't generate arps but rather sends
unicast packets.<BR>> For this packets we don't have enough information in
the end point and<BR>> therefore we can't send them correctly. In the past
for this packets<BR>> we used to do a query on the SM, but we don't want to
do that anymore.<BR>> So my question is this, how do we want to solve this
issue:<BR>> 1) Wait for the windows arp table to flash? Probably too
long.<BR>> 2) Send queries to the SM? We wanted to avoid that.<BR>> 3)
Don't clear the endpoints when opensm is being
restarted?<BR>> Seems that we might use old data.<BR>>
4) Send arps by ourselves? Probably the best solution but
requires<BR>> some more work.<BR>><BR>> What do you
think?<BR><BR>I think the key here might be to keep *some* of the SM
interaction - effectively put a path record cache in IPoIB. If we kept
the existing path record query logic in IPoIB the issues with SM restart go
away. We would then need to change how the MAC_TO_PATH IOCTL behaved,
allowing requests to be queued and completed asynchronously. The IOCTL
handler would look up the endpoint, and if no path was resolved would issue
the path query if it wasn't in progress already. This would require
queueing the IRPs and tracking them so that a path query completion would
complete any pending IRPs.<BR><BR>Probably the simplest way to handle this
would be to queue the IRPs in the IBAT layer when they come in, and then try
to flush as many IRPs from the queue (look to see if the endpoints have valid
paths). Any endpoint that needs a path would have a query issued, and a
path query completion would again try to flush as many IRPs form the IBAT
queue as possible.<BR><BR>The main advantages to this is that real path
records would be used for unicast traffic as well as IBAT clients, so that the
packet rate, MTU, and so forth are set optimally, but the cache would be
updated whenever an ARP response is received, remaining in sync with the
network stack.<BR><BR>I hope the SM would not have a problem with path queries
like this - the query load would grow as the square of number of nodes, rather
than the square of the number of
cores.<BR><BR>-Fab<BR></P></BLOCKQUOTE></FONT></BODY></HTML>