<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE></TITLE>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.6000.16640" name=GENERATOR></HEAD>
<BODY><!-- Converted from text/plain format -->
<P><FONT size=2><FONT size=3>Reposting this issue to the whole
community.<BR><U>Current Problem:<BR></U>"Avoid the SM" patch caused IPoIB
for an invalid cache cleaning. That is, when restarting an opensm,
IPoIB communication stop to work (including pings)</FONT></FONT><FONT
size=2><FONT size=3><BR><U>Detailed Description:</U><BR>1.1 After restarting an
opensm, __endpt_mgr_reset_all() sets dlid==0 for the enpoints.<BR>Thus, ARP
should be sent in order to resume the normal communication.<BR>1.2 ARP indeed
was sent, and even received by the remote side.<BR>But (put attention), we send
ARP by broadcast, but the ARP response is always unicast with INVALID
DLID.<BR>Thus, normal communication can't be resumed without ARP response, and
ARP response can't be send without valid dlid.</FONT></FONT></P>
<DIV><FONT size=2><FONT size=3><U>Proposed Solution:</U></FONT></FONT></DIV>
<DIV><FONT size=2><FONT size=3>2.1. When receiving an ARP request dlid is equal
to zero, delete this endpoint and recreate it.<BR>2.2 In order to initialize ARP
table (and thus generate ARP requests), notify to NDIS link down/link
up</FONT></FONT></DIV>
<DIV> </DIV>
<DIV><U>Checklist (we executed this checks on 8-node cluster)</U></DIV>
<DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>1. Run opensm and
validate that ping works</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>2. Kill opensm. Ping
still should work</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>3. Restart opensm on
the same node. Ping should work</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>4. Rerun
#2</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>5. Restart opensm on
another node. Ping should work</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>6. Run another
instance of opensm, such that the previous instance will switch to "standby
mode". Ping should work</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial
size=2>OR:</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>6A Run another
instance of opensm, such that the previous instance will remain in "active
mode". </FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>Then kill active
instance. The "standby" instance should enter active mode, and a ping should
remain.</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>Ping should
work</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>7.Run 2 different
instance of opensm. During the run, clear guid2lid file and kill active
instance.</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>Passive instance
will become active and ping still should work</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>8.Change guid2lid
file (change lids only) and restart opensm. Ping should work</FONT></SPAN></DIV>
<DIV><SPAN class=984052407-17092008><FONT face=Arial size=2>IMPORTANT! Validate
here that IPoIB adresses didn't changed, but lids did , so that pings will be
sent to the right host</FONT></SPAN></DIV></DIV>
<DIV><FONT face=Arial color=#0000ff size=2></FONT> </DIV>
<DIV><FONT face=Arial color=#0000ff size=2></FONT> </DIV>
<DIV>Fix to "Avoid the SM" </DIV>
<DIV>signed-off by: Alexander Naslednikov (<A
href="mailto:xalex@mellanox.co.il">xalex at mellanox.co.il</A>)</DIV>
<DIV>===================================================================<BR>---
ipoib_port.c (revision 3149)<BR>+++
ipoib_port.c (working copy)<BR>@@
-2357,6 +2357,11 @@<BR>
/* Out of date! Destroy the
endpoint and replace it. */<BR>
__endpt_mgr_remove( p_port, *pp_src
);<BR>
*pp_src =
NULL;<BR>+
}<BR>+
else if ( ! ((*pp_src)->dlid))
{<BR>+
/* Out of date! Destroy the
endpoint and replace it. */<BR>+
__endpt_mgr_remove( p_port, *pp_src
);<BR>+
*pp_src =
NULL;<BR>
}<BR>
else if(
ipoib_is_voltaire_router_gid( &(*pp_src)->dgid )
)<BR>
{<BR>@@ -4153,10 +4158,25
@@<BR> cl_qlist_init( &mc_list
);<BR> <BR>
cl_obj_lock( &p_port->obj
);<BR>+<BR> /* Wait for all readers to
complete. */<BR> while(
p_port->endpt_rdr )<BR>
;<BR>+#if
0<BR>+
__endpt_mgr_remove_all(p_port);<BR>+#else<BR><BR>+
NdisMIndicateStatus(
p_port->p_adapter->h_adapter,<BR>+
NDIS_STATUS_MEDIA_DISCONNECT, NULL, 0
);<BR>+ NdisMIndicateStatusComplete(
p_port->p_adapter->h_adapter
);<BR>+ <BR>+
NdisMIndicateStatus(
p_port->p_adapter->h_adapter,<BR>+
NDIS_STATUS_MEDIA_CONNECT, NULL, 0
);<BR>+ NdisMIndicateStatusComplete(
p_port->p_adapter->h_adapter
);<BR>+ <BR>+<BR>
if( p_port->p_local_endpt )<BR>
{<BR>
cl_qmap_remove_item(
&p_port->endpt_mgr.mac_endpts,<BR></DIV>
<DIV><FONT size=2><FONT size=3></FONT></FONT> </DIV>
<DIV><FONT size=2><FONT size=3></FONT> </DIV>
<P><FONT face=Arial color=#0000ff></FONT><FONT face=Arial
color=#0000ff></FONT><FONT face=Arial color=#0000ff></FONT><BR><BR>-----Original
Message-----<BR>From: Alex Naslednikov<BR>Sent: Sunday, September 14, 2008 5:32
PM<BR>To: 'Fab Tillier'; Tzachi Dar; Reuven Amitai; Leonid Keller<BR>Cc: Ishai
Rabinovitz<BR>Subject: RE: [ofw] Problem with "Avoid the SM" patch<BR><BR>I'd
like just to summarize all we said before and to propose a temporarily
solution.<BR><BR>1. The Problem<BR>1.1 After restarting an opensm,
__endpt_mgr_reset_all() sets dlid==0 for the enpoints.<BR>Thus, ARP should be
sent in order to resume the normal communication.<BR>1.2 ARP indeed was sent,
and even received by the remote side.<BR>But (put attention), we send ARP by
broadcast, but the ARP response is always unicast with INVALID DLID.<BR>Thus,
normal communication can't be resumed withoud ARP response, and ARP response
can't be send without valid dlid.<BR><BR>So, in order to resolve it, there's our
proposal. It’s a temporary solution only.<BR>Of course, it should be
investigated on a large cluster<BR><BR>2. The solution<BR>2.1. When receiving an
ARP request dlid is equal to zero, delete this endpoint and recreate it.<BR>2.2
In order to initialize ARP table (and thus generate ARP requests), notify to
NDIS link down/link
up<BR><BR><BR>===================================================================<BR>---
ipoib_port.c (revision 3149)<BR>+++
ipoib_port.c (working copy)<BR>@@
-2357,6 +2357,11 @@<BR>
/* Out of date! Destroy the
endpoint and replace it. */<BR>
__endpt_mgr_remove( p_port, *pp_src
);<BR>
*pp_src =
NULL;<BR>+
}<BR>+
else if ( ! ((*pp_src)->dlid))
{<BR>+
/* Out of date! Destroy the
endpoint and replace it. */<BR>+
__endpt_mgr_remove( p_port, *pp_src
);<BR>+
*pp_src =
NULL;<BR>
}<BR>
else if(
ipoib_is_voltaire_router_gid( &(*pp_src)->dgid )
)<BR>
{<BR>@@ -4153,10 +4158,25
@@<BR> cl_qlist_init( &mc_list
);<BR> <BR>
cl_obj_lock( &p_port->obj
);<BR>+<BR> /* Wait for all readers to
complete. */<BR> while(
p_port->endpt_rdr )<BR>
;<BR>+#if
0<BR>+
__endpt_mgr_remove_all(p_port);<BR>+#else<BR><BR>+
NdisMIndicateStatus(
p_port->p_adapter->h_adapter,<BR>+
NDIS_STATUS_MEDIA_DISCONNECT, NULL, 0
);<BR>+ NdisMIndicateStatusComplete(
p_port->p_adapter->h_adapter
);<BR>+ <BR>+
NdisMIndicateStatus(
p_port->p_adapter->h_adapter,<BR>+
NDIS_STATUS_MEDIA_CONNECT, NULL, 0
);<BR>+ NdisMIndicateStatusComplete(
p_port->p_adapter->h_adapter
);<BR>+ <BR>+
//
IPOIB_PRINT( TRACE_LEVEL_INFORMATION,
IPOIB_DBG_INIT,<BR>+
//
("Link DOWN!\n") );<BR>+<BR> if(
p_port->p_local_endpt )<BR>
{<BR>
cl_qmap_remove_item(
&p_port->endpt_mgr.mac_endpts,<BR><BR><BR>XaleX<BR>-----Original
Message-----<BR>From: Fab Tillier [<A
href="mailto:ftillier@windows.microsoft.com">mailto:ftillier@windows.microsoft.com</A>]<BR>Sent:
Friday, September 12, 2008 6:00 PM<BR>To: Tzachi Dar; Alex Naslednikov; Reuven
Amitai; Leonid Keller<BR>Cc: ofw@lists.openfabrics.org<BR>Subject: RE: [ofw]
Problem with "Avoid the SM" patch<BR><BR>> Hi Fab,<BR>><BR>> Here is
some more information about the issue and one question.<BR>> There are
currently two problems that we see. Both problems start<BR>> after we restart
opensm.<BR>><BR>> 1) After we restart opensm arp messages don't pass. The
main reason we<BR>> saw so far is that they are sent with the wrong
addresses. Although we<BR>> haven't still found exactly why that is, we will
soon find that and<BR>> fix it.<BR><BR>Is it just a problem with the ARP
responses, or the requests too? The requests should be getting sent to the
broadcast group, so they should work. The response is a unicast packet, so
could be getting lost due to the dlid == 0 issue.<BR><BR>> 2) This is the
more problematic issue: After we restart opensm<BR>> __endpt_mgr_reset_all is
being called. As a result all our endpoint<BR>> cache is cleared. Please note
that windows is not aware of what<BR>> happened and therefore it doesn't
generate arps but rather sends unicast packets.<BR>> For this packets we
don't have enough information in the end point and<BR>> therefore we can't
send them correctly. In the past for this packets<BR>> we used to do a query
on the SM, but we don't want to do that anymore.<BR>> So my question is this,
how do we want to solve this issue:<BR>> 1) Wait for the windows arp table to
flash? Probably too long.<BR>> 2) Send queries to the SM? We wanted to avoid
that.<BR>> 3) Don't clear the endpoints when opensm is being
restarted?<BR>> Seems that we might use old data.<BR>>
4) Send arps by ourselves? Probably the best solution but
requires<BR>> some more work.<BR>><BR>> What do you
think?<BR><BR>I think the key here might be to keep *some* of the SM interaction
- effectively put a path record cache in IPoIB. If we kept the existing
path record query logic in IPoIB the issues with SM restart go away. We
would then need to change how the MAC_TO_PATH IOCTL behaved, allowing requests
to be queued and completed asynchronously. The IOCTL handler would look up
the endpoint, and if no path was resolved would issue the path query if it
wasn't in progress already. This would require queueing the IRPs and
tracking them so that a path query completion would complete any pending
IRPs.<BR><BR>Probably the simplest way to handle this would be to queue the IRPs
in the IBAT layer when they come in, and then try to flush as many IRPs from the
queue (look to see if the endpoints have valid paths). Any endpoint that
needs a path would have a query issued, and a path query completion would again
try to flush as many IRPs form the IBAT queue as possible.<BR><BR>The main
advantages to this is that real path records would be used for unicast traffic
as well as IBAT clients, so that the packet rate, MTU, and so forth are set
optimally, but the cache would be updated whenever an ARP response is received,
remaining in sync with the network stack.<BR><BR>I hope the SM would not have a
problem with path queries like this - the query load would grow as the square of
number of nodes, rather than the square of the number of
cores.<BR><BR>-Fab<BR></P></FONT></BODY></HTML>