[ofw] [IPoIB] Problem with "Avoid the SM" patch

Alex Naslednikov xalex at mellanox.co.il
Wed Sep 17 02:00:29 PDT 2008


I spoke with Evgeny, a Mellanox opensm owner.
He claims that there were similar try in Linux to avoid the subnet
manager communication, but currently this feature still has unresolved
problems and, therefore, disabled by default.
Also, according to Evgeny, the current problem that opensm is not
scalable (starting from 64x8 MPI jobs) is because we try to connect
opensm after "PORT UP" event and not after  "IPoIB UP" event.
 
Fab, can you modify your patch in order to allow user select between the
old and the new solutions ? (i.e. with/without "avoid the sm patch)
 
Thanks,
XaleX
 

________________________________

From: ofw-bounces at lists.openfabrics.org
[mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Alex Naslednikov
Sent: Wednesday, September 17, 2008 11:47 AM
To: ofw at lists.openfabrics.org; Fab Tillier; Leonid Keller
Subject: [ofw] [IPoIB] Problem with "Avoid the SM" patch 



Reposting this issue to the whole community.
Current Problem:
"Avoid the SM" patch caused IPoIB for an invalid cache cleaning. That
is, when restarting an opensm, IPoIB communication stop to work
(including pings)
Detailed Description:
1.1 After restarting an opensm, __endpt_mgr_reset_all() sets dlid==0 for
the enpoints.
Thus, ARP should be sent in order to resume the normal communication.
1.2 ARP indeed was sent, and even received by the remote side.
But (put attention), we send ARP by broadcast, but the ARP response is
always unicast with INVALID DLID.
Thus, normal communication can't be resumed without ARP response, and
ARP response can't be send without valid dlid.

Proposed Solution:
2.1. When receiving an ARP request dlid is equal to zero, delete this
endpoint and recreate it.
2.2 In order to initialize ARP table (and thus generate ARP requests),
notify to NDIS link down/link up
 
Checklist (we executed this checks on 8-node cluster)
1. Run opensm and validate that ping works
2. Kill opensm. Ping still should work
3. Restart opensm on the same node. Ping should work
4. Rerun #2
5. Restart opensm on another node. Ping should work
6. Run another instance of opensm, such that the previous instance will
switch to "standby mode". Ping should work
OR:
6A Run another instance of opensm, such that the previous instance will
remain in "active mode". 
Then kill active instance. The "standby" instance should enter active
mode, and a ping should remain.
Ping should work
7.Run 2 different instance of opensm. During the run, clear guid2lid
file and kill active instance.
Passive instance will become active and ping still should work
8.Change guid2lid file (change lids only) and restart opensm. Ping
should work
IMPORTANT! Validate here that IPoIB adresses didn't changed, but lids
did , so that pings will be sent to the right host
 
 
Fix to "Avoid the SM" 
signed-off by: Alexander Naslednikov (xalex at mellanox.co.il
<mailto:xalex at mellanox.co.il> )
===================================================================
--- ipoib_port.c        (revision 3149)
+++ ipoib_port.c        (working copy)
@@ -2357,6 +2357,11 @@
                        /* Out of date!  Destroy the endpoint and
replace it. */
                        __endpt_mgr_remove( p_port, *pp_src );
                        *pp_src = NULL;
+               }
+               else if ( ! ((*pp_src)->dlid)) {
+                       /* Out of date!  Destroy the endpoint and
replace it. */
+                       __endpt_mgr_remove( p_port, *pp_src );
+                       *pp_src = NULL;
                }
                else if( ipoib_is_voltaire_router_gid( &(*pp_src)->dgid
) )
                {
@@ -4153,10 +4158,25 @@
        cl_qlist_init( &mc_list );
       
        cl_obj_lock( &p_port->obj );
+
        /* Wait for all readers to complete. */
        while( p_port->endpt_rdr )
                ;
+#if 0
+               __endpt_mgr_remove_all(p_port);
+#else

+       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
+                                       NDIS_STATUS_MEDIA_DISCONNECT,
NULL, 0 );
+       NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
+      
+       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
+                                       NDIS_STATUS_MEDIA_CONNECT, NULL,
0 );
+       NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
+      
+
        if( p_port->p_local_endpt )
        {
                cl_qmap_remove_item( &p_port->endpt_mgr.mac_endpts,

 
 



-----Original Message-----
From: Alex Naslednikov
Sent: Sunday, September 14, 2008 5:32 PM
To: 'Fab Tillier'; Tzachi Dar; Reuven Amitai; Leonid Keller
Cc: Ishai Rabinovitz
Subject: RE: [ofw] Problem with "Avoid the SM" patch

I'd like just to summarize all we said before and to propose a
temporarily solution.

1. The Problem
1.1 After restarting an opensm, __endpt_mgr_reset_all() sets dlid==0 for
the enpoints.
Thus, ARP should be sent in order to resume the normal communication.
1.2 ARP indeed was sent, and even received by the remote side.
But (put attention), we send ARP by broadcast, but the ARP response is
always unicast with INVALID DLID.
Thus, normal communication can't be resumed withoud ARP response, and
ARP response can't be send without valid dlid.

So, in order to resolve it, there's our proposal. It's a temporary
solution only.
Of course, it should be investigated on a large cluster

2. The solution
2.1. When receiving an ARP request dlid is equal to zero, delete this
endpoint and recreate it.
2.2 In order to initialize ARP table (and thus generate ARP requests),
notify to NDIS link down/link up


===================================================================
--- ipoib_port.c        (revision 3149)
+++ ipoib_port.c        (working copy)
@@ -2357,6 +2357,11 @@
                        /* Out of date!  Destroy the endpoint and
replace it. */
                        __endpt_mgr_remove( p_port, *pp_src );
                        *pp_src = NULL;
+               }
+               else if ( ! ((*pp_src)->dlid)) {
+                       /* Out of date!  Destroy the endpoint and
replace it. */
+                       __endpt_mgr_remove( p_port, *pp_src );
+                       *pp_src = NULL;
                }
                else if( ipoib_is_voltaire_router_gid( &(*pp_src)->dgid
) )
                {
@@ -4153,10 +4158,25 @@
        cl_qlist_init( &mc_list );
       
        cl_obj_lock( &p_port->obj );
+
        /* Wait for all readers to complete. */
        while( p_port->endpt_rdr )
                ;
+#if 0
+               __endpt_mgr_remove_all(p_port);
+#else

+       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
+                                       NDIS_STATUS_MEDIA_DISCONNECT,
NULL, 0 );
+       NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
+      
+       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
+                                       NDIS_STATUS_MEDIA_CONNECT, NULL,
0 );
+       NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
+      
+                       //      IPOIB_PRINT( TRACE_LEVEL_INFORMATION,
IPOIB_DBG_INIT,
+                               //      ("Link DOWN!\n") );
+
        if( p_port->p_local_endpt )
        {
                cl_qmap_remove_item( &p_port->endpt_mgr.mac_endpts,


XaleX
-----Original Message-----
From: Fab Tillier [mailto:ftillier at windows.microsoft.com]
Sent: Friday, September 12, 2008 6:00 PM
To: Tzachi Dar; Alex Naslednikov; Reuven Amitai; Leonid Keller
Cc: ofw at lists.openfabrics.org
Subject: RE: [ofw] Problem with "Avoid the SM" patch

> Hi Fab,
>
> Here is some more information about the issue and one question.
> There are currently two problems that we see. Both problems start
> after we restart opensm.
>
> 1) After we restart opensm arp messages don't pass. The main reason we
> saw so far is that they are sent with the wrong addresses. Although we
> haven't still found exactly why that is, we will soon find that and
> fix it.

Is it just a problem with the ARP responses, or the requests too?  The
requests should be getting sent to the broadcast group, so they should
work.  The response is a unicast packet, so could be getting lost due to
the dlid == 0 issue.

> 2) This is the more problematic issue: After we restart opensm
> __endpt_mgr_reset_all is being called. As a result all our endpoint
> cache is cleared. Please note that windows is not aware of what
> happened and therefore it doesn't generate arps but rather sends
unicast packets.
> For this packets we don't have enough information in the end point and
> therefore we can't send them correctly. In the past for this packets
> we used to do a query on the SM, but we don't want to do that anymore.
> So my question is this, how do we want to solve this issue:
> 1) Wait for the windows arp table to flash? Probably too long.
> 2) Send queries to the SM? We wanted to avoid that.
> 3) Don't clear the endpoints when opensm is being restarted?
>    Seems that we might use old data.
> 4) Send arps by ourselves? Probably the best solution but requires
>    some more work.
>
> What do you think?

I think the key here might be to keep *some* of the SM interaction -
effectively put a path record cache in IPoIB.  If we kept the existing
path record query logic in IPoIB the issues with SM restart go away.  We
would then need to change how the MAC_TO_PATH IOCTL behaved, allowing
requests to be queued and completed asynchronously.  The IOCTL handler
would look up the endpoint, and if no path was resolved would issue the
path query if it wasn't in progress already.  This would require
queueing the IRPs and tracking them so that a path query completion
would complete any pending IRPs.

Probably the simplest way to handle this would be to queue the IRPs in
the IBAT layer when they come in, and then try to flush as many IRPs
from the queue (look to see if the endpoints have valid paths).  Any
endpoint that needs a path would have a query issued, and a path query
completion would again try to flush as many IRPs form the IBAT queue as
possible.

The main advantages to this is that real path records would be used for
unicast traffic as well as IBAT clients, so that the packet rate, MTU,
and so forth are set optimally, but the cache would be updated whenever
an ARP response is received, remaining in sync with the network stack.

I hope the SM would not have a problem with path queries like this - the
query load would grow as the square of number of nodes, rather than the
square of the number of cores.

-Fab


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20080917/34aae3dd/attachment.html>


More information about the ofw mailing list