[ofw] [IPoIB] Problem with "Avoid the SM" patch

Alex Estrin alex.estrin at qlogic.com
Wed Sep 17 05:13:46 PDT 2008


Please see my questions inline.
Thanks,
Alex.


________________________________

	From: ofw-bounces at lists.openfabrics.org
[mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Alex Naslednikov
	Sent: Wednesday, September 17, 2008 4:47 AM
	To: ofw at lists.openfabrics.org; Fab Tillier; Leonid Keller
	Subject: [ofw] [IPoIB] Problem with "Avoid the SM" patch 
	
	

	Reposting this issue to the whole community.
	Current Problem:
	"Avoid the SM" patch caused IPoIB for an invalid cache cleaning.
That is, when restarting an opensm, IPoIB communication stop to work
(including pings)
	Detailed Description:
	1.1 After restarting an opensm, __endpt_mgr_reset_all() sets
dlid==0 for the enpoints.
	Thus, ARP should be sent in order to resume the normal
communication.
	1.2 ARP indeed was sent, and even received by the remote side.
	But (put attention), we send ARP by broadcast, but the ARP
response is always unicast with INVALID DLID.
	Thus, normal communication can't be resumed without ARP
response, and ARP response can't be send without valid dlid.

	Proposed Solution:
	2.1. When receiving an ARP request dlid is equal to zero, delete
this endpoint and recreate it.
	2.2 In order to initialize ARP table (and thus generate ARP
requests), notify to NDIS link down/link up 
	 
	Does driver receives PnP events  IB_PNP_SM_CHANGE or
IB_PNP_LID_CHANGE ? 
	Isn't __endpt_mgr_reset_all() called in context of
ipoib_port_down() which  activated after OS  notified of
MEDIA_DISCONNECT ,
	so ARP table will be reset? 
	 
	 Checklist (we executed this checks on 8-node cluster)
	1. Run opensm and validate that ping works
	2. Kill opensm. Ping still should work
	3. Restart opensm on the same node. Ping should work
	4. Rerun #2
	5. Restart opensm on another node. Ping should work
	6. Run another instance of opensm, such that the previous
instance will switch to "standby mode". Ping should work
	OR:
	6A Run another instance of opensm, such that the previous
instance will remain in "active mode". 
	Then kill active instance. The "standby" instance should enter
active mode, and a ping should remain.
	Ping should work
	7.Run 2 different instance of opensm. During the run, clear
guid2lid file and kill active instance.
	Passive instance will become active and ping still should work
	8.Change guid2lid file (change lids only) and restart opensm.
Ping should work
	IMPORTANT! Validate here that IPoIB adresses didn't changed, but
lids did , so that pings will be sent to the right host
	 
	 
	Fix to "Avoid the SM" 
	signed-off by: Alexander Naslednikov (xalex at mellanox.co.il
<mailto:xalex at mellanox.co.il> )
	
===================================================================
	--- ipoib_port.c        (revision 3149)
	+++ ipoib_port.c        (working copy)
	@@ -2357,6 +2357,11 @@
	                        /* Out of date!  Destroy the endpoint
and replace it. */
	                        __endpt_mgr_remove( p_port, *pp_src );
	                        *pp_src = NULL;
	+               }
	+               else if ( ! ((*pp_src)->dlid)) {
	+                       /* Out of date!  Destroy the endpoint
and replace it. */
	+                       __endpt_mgr_remove( p_port, *pp_src );
	+                       *pp_src = NULL;
	                }
	                else if( ipoib_is_voltaire_router_gid(
&(*pp_src)->dgid ) )
	                {
	@@ -4153,10 +4158,25 @@
	        cl_qlist_init( &mc_list );
	       
	        cl_obj_lock( &p_port->obj );
	+
	        /* Wait for all readers to complete. */
	        while( p_port->endpt_rdr )
	                ;
	+#if 0
	+               __endpt_mgr_remove_all(p_port);
	+#else 
	 
	Confusing. Why this call commented out?
	 Isn't it called on port destroying and not endpoint destroying
context?
	
	+       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
	+
NDIS_STATUS_MEDIA_DISCONNECT, NULL, 0 );
	+       NdisMIndicateStatusComplete(
p_port->p_adapter->h_adapter );
	+      
	+       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
	+
NDIS_STATUS_MEDIA_CONNECT, NULL, 0 );
	+       NdisMIndicateStatusComplete(
p_port->p_adapter->h_adapter );
	+      
	+
	        if( p_port->p_local_endpt )
	        {
	                cl_qmap_remove_item(
&p_port->endpt_mgr.mac_endpts,
	
	 
	 

	
	
	-----Original Message-----
	From: Alex Naslednikov
	Sent: Sunday, September 14, 2008 5:32 PM
	To: 'Fab Tillier'; Tzachi Dar; Reuven Amitai; Leonid Keller
	Cc: Ishai Rabinovitz
	Subject: RE: [ofw] Problem with "Avoid the SM" patch
	
	I'd like just to summarize all we said before and to propose a
temporarily solution.
	
	1. The Problem
	1.1 After restarting an opensm, __endpt_mgr_reset_all() sets
dlid==0 for the enpoints.
	Thus, ARP should be sent in order to resume the normal
communication.
	1.2 ARP indeed was sent, and even received by the remote side.
	But (put attention), we send ARP by broadcast, but the ARP
response is always unicast with INVALID DLID.
	Thus, normal communication can't be resumed withoud ARP
response, and ARP response can't be send without valid dlid.
	
	So, in order to resolve it, there's our proposal. It's a
temporary solution only.
	Of course, it should be investigated on a large cluster
	
	2. The solution
	2.1. When receiving an ARP request dlid is equal to zero, delete
this endpoint and recreate it.
	2.2 In order to initialize ARP table (and thus generate ARP
requests), notify to NDIS link down/link up
	
	
	
===================================================================
	--- ipoib_port.c        (revision 3149)
	+++ ipoib_port.c        (working copy)
	@@ -2357,6 +2357,11 @@
	                        /* Out of date!  Destroy the endpoint
and replace it. */
	                        __endpt_mgr_remove( p_port, *pp_src );
	                        *pp_src = NULL;
	+               }
	+               else if ( ! ((*pp_src)->dlid)) {
	+                       /* Out of date!  Destroy the endpoint
and replace it. */
	+                       __endpt_mgr_remove( p_port, *pp_src );
	+                       *pp_src = NULL;
	                }
	                else if( ipoib_is_voltaire_router_gid(
&(*pp_src)->dgid ) )
	                {
	@@ -4153,10 +4158,25 @@
	        cl_qlist_init( &mc_list );
	       
	        cl_obj_lock( &p_port->obj );
	+
	        /* Wait for all readers to complete. */
	        while( p_port->endpt_rdr )
	                ;
	+#if 0
	+               __endpt_mgr_remove_all(p_port);
	+#else
	
	+       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
	+
NDIS_STATUS_MEDIA_DISCONNECT, NULL, 0 );
	+       NdisMIndicateStatusComplete(
p_port->p_adapter->h_adapter );
	+      
	+       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
	+
NDIS_STATUS_MEDIA_CONNECT, NULL, 0 );
	+       NdisMIndicateStatusComplete(
p_port->p_adapter->h_adapter );
	+      
	+                       //      IPOIB_PRINT(
TRACE_LEVEL_INFORMATION, IPOIB_DBG_INIT,
	+                               //      ("Link DOWN!\n") );
	+
	        if( p_port->p_local_endpt )
	        {
	                cl_qmap_remove_item(
&p_port->endpt_mgr.mac_endpts,
	
	
	XaleX
	-----Original Message-----
	From: Fab Tillier [mailto:ftillier at windows.microsoft.com]
	Sent: Friday, September 12, 2008 6:00 PM
	To: Tzachi Dar; Alex Naslednikov; Reuven Amitai; Leonid Keller
	Cc: ofw at lists.openfabrics.org
	Subject: RE: [ofw] Problem with "Avoid the SM" patch
	
	> Hi Fab,
	>
	> Here is some more information about the issue and one
question.
	> There are currently two problems that we see. Both problems
start
	> after we restart opensm.
	>
	> 1) After we restart opensm arp messages don't pass. The main
reason we
	> saw so far is that they are sent with the wrong addresses.
Although we
	> haven't still found exactly why that is, we will soon find
that and
	> fix it.
	
	Is it just a problem with the ARP responses, or the requests
too?  The requests should be getting sent to the broadcast group, so
they should work.  The response is a unicast packet, so could be getting
lost due to the dlid == 0 issue.
	
	> 2) This is the more problematic issue: After we restart opensm
	> __endpt_mgr_reset_all is being called. As a result all our
endpoint
	> cache is cleared. Please note that windows is not aware of
what
	> happened and therefore it doesn't generate arps but rather
sends unicast packets.
	> For this packets we don't have enough information in the end
point and
	> therefore we can't send them correctly. In the past for this
packets
	> we used to do a query on the SM, but we don't want to do that
anymore.
	> So my question is this, how do we want to solve this issue:
	> 1) Wait for the windows arp table to flash? Probably too long.
	> 2) Send queries to the SM? We wanted to avoid that.
	> 3) Don't clear the endpoints when opensm is being restarted?
	>    Seems that we might use old data.
	> 4) Send arps by ourselves? Probably the best solution but
requires
	>    some more work.
	>
	> What do you think?
	
	I think the key here might be to keep *some* of the SM
interaction - effectively put a path record cache in IPoIB.  If we kept
the existing path record query logic in IPoIB the issues with SM restart
go away.  We would then need to change how the MAC_TO_PATH IOCTL
behaved, allowing requests to be queued and completed asynchronously.
The IOCTL handler would look up the endpoint, and if no path was
resolved would issue the path query if it wasn't in progress already.
This would require queueing the IRPs and tracking them so that a path
query completion would complete any pending IRPs.
	
	Probably the simplest way to handle this would be to queue the
IRPs in the IBAT layer when they come in, and then try to flush as many
IRPs from the queue (look to see if the endpoints have valid paths).
Any endpoint that needs a path would have a query issued, and a path
query completion would again try to flush as many IRPs form the IBAT
queue as possible.
	
	The main advantages to this is that real path records would be
used for unicast traffic as well as IBAT clients, so that the packet
rate, MTU, and so forth are set optimally, but the cache would be
updated whenever an ARP response is received, remaining in sync with the
network stack.
	
	I hope the SM would not have a problem with path queries like
this - the query load would grow as the square of number of nodes,
rather than the square of the number of cores.
	
	-Fab
	

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20080917/b72672ae/attachment.html>


More information about the ofw mailing list