***SPAM*** Re: [ofw] [IPoIB] Problem with "Avoid the SM" patch

Hal Rosenstock hal.rosenstock at gmail.com
Thu Sep 18 05:20:25 PDT 2008


Alex,

On Wed, Sep 17, 2008 at 5:00 AM, Alex Naslednikov <xalex at mellanox.co.il> wrote:
> I spoke with Evgeny, a Mellanox opensm owner.
> He claims that there were similar try in Linux to avoid the subnet manager
> communication, but currently this feature still has unresolved problems and,
> therefore, disabled by default.

Are you referring to the SA cache ?

> Also, according to Evgeny, the current problem that opensm is not scalable
> (starting from 64x8 MPI jobs) is because we try to connect opensm after
> "PORT UP" event and not after  "IPoIB UP" event.

What is meant by "IPoIB UP" ? Does this mean "operationally up" ? What
is used to determine this ?

I also don't understand what you mean by "connect opensm". Do you mean
query the SM/SA ?

-- Hal

> Fab, can you modify your patch in order to allow user select between the old
> and the new solutions ? (i.e. with/without "avoid the sm patch)
>
> Thanks,
> XaleX
>
> ________________________________
> From: ofw-bounces at lists.openfabrics.org
> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Alex Naslednikov
> Sent: Wednesday, September 17, 2008 11:47 AM
> To: ofw at lists.openfabrics.org; Fab Tillier; Leonid Keller
> Subject: [ofw] [IPoIB] Problem with "Avoid the SM" patch
>
> Reposting this issue to the whole community.
> Current Problem:
> "Avoid the SM" patch caused IPoIB for an invalid cache cleaning. That is,
> when restarting an opensm, IPoIB communication stop to work (including
> pings)
> Detailed Description:
> 1.1 After restarting an opensm, __endpt_mgr_reset_all() sets dlid==0 for the
> enpoints.
> Thus, ARP should be sent in order to resume the normal communication.
> 1.2 ARP indeed was sent, and even received by the remote side.
> But (put attention), we send ARP by broadcast, but the ARP response is
> always unicast with INVALID DLID.
> Thus, normal communication can't be resumed without ARP response, and ARP
> response can't be send without valid dlid.
>
> Proposed Solution:
> 2.1. When receiving an ARP request dlid is equal to zero, delete this
> endpoint and recreate it.
> 2.2 In order to initialize ARP table (and thus generate ARP requests),
> notify to NDIS link down/link up
>
> Checklist (we executed this checks on 8-node cluster)
> 1. Run opensm and validate that ping works
> 2. Kill opensm. Ping still should work
> 3. Restart opensm on the same node. Ping should work
> 4. Rerun #2
> 5. Restart opensm on another node. Ping should work
> 6. Run another instance of opensm, such that the previous instance will
> switch to "standby mode". Ping should work
> OR:
> 6A Run another instance of opensm, such that the previous instance will
> remain in "active mode".
> Then kill active instance. The "standby" instance should enter active mode,
> and a ping should remain.
> Ping should work
> 7.Run 2 different instance of opensm. During the run, clear guid2lid file
> and kill active instance.
> Passive instance will become active and ping still should work
> 8.Change guid2lid file (change lids only) and restart opensm. Ping should
> work
> IMPORTANT! Validate here that IPoIB adresses didn't changed, but lids did ,
> so that pings will be sent to the right host
>
>
> Fix to "Avoid the SM"
> signed-off by: Alexander Naslednikov (xalex at mellanox.co.il)
> ===================================================================
> --- ipoib_port.c        (revision 3149)
> +++ ipoib_port.c        (working copy)
> @@ -2357,6 +2357,11 @@
>                         /* Out of date!  Destroy the endpoint and replace
> it. */
>                         __endpt_mgr_remove( p_port, *pp_src );
>                         *pp_src = NULL;
> +               }
> +               else if ( ! ((*pp_src)->dlid)) {
> +                       /* Out of date!  Destroy the endpoint and replace
> it. */
> +                       __endpt_mgr_remove( p_port, *pp_src );
> +                       *pp_src = NULL;
>                 }
>                 else if( ipoib_is_voltaire_router_gid( &(*pp_src)->dgid ) )
>                 {
> @@ -4153,10 +4158,25 @@
>         cl_qlist_init( &mc_list );
>
>         cl_obj_lock( &p_port->obj );
> +
>         /* Wait for all readers to complete. */
>         while( p_port->endpt_rdr )
>                 ;
> +#if 0
> +               __endpt_mgr_remove_all(p_port);
> +#else
>
> +       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
> +                                       NDIS_STATUS_MEDIA_DISCONNECT, NULL,
> 0 );
> +       NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
> +
> +       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
> +                                       NDIS_STATUS_MEDIA_CONNECT, NULL, 0
> );
> +       NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
> +
> +
>         if( p_port->p_local_endpt )
>         {
>                 cl_qmap_remove_item( &p_port->endpt_mgr.mac_endpts,
>
>
>
> -----Original Message-----
> From: Alex Naslednikov
> Sent: Sunday, September 14, 2008 5:32 PM
> To: 'Fab Tillier'; Tzachi Dar; Reuven Amitai; Leonid Keller
> Cc: Ishai Rabinovitz
> Subject: RE: [ofw] Problem with "Avoid the SM" patch
>
> I'd like just to summarize all we said before and to propose a temporarily
> solution.
>
> 1. The Problem
> 1.1 After restarting an opensm, __endpt_mgr_reset_all() sets dlid==0 for the
> enpoints.
> Thus, ARP should be sent in order to resume the normal communication.
> 1.2 ARP indeed was sent, and even received by the remote side.
> But (put attention), we send ARP by broadcast, but the ARP response is
> always unicast with INVALID DLID.
> Thus, normal communication can't be resumed withoud ARP response, and ARP
> response can't be send without valid dlid.
>
> So, in order to resolve it, there's our proposal. It's a temporary solution
> only.
> Of course, it should be investigated on a large cluster
>
> 2. The solution
> 2.1. When receiving an ARP request dlid is equal to zero, delete this
> endpoint and recreate it.
> 2.2 In order to initialize ARP table (and thus generate ARP requests),
> notify to NDIS link down/link up
>
>
> ===================================================================
> --- ipoib_port.c        (revision 3149)
> +++ ipoib_port.c        (working copy)
> @@ -2357,6 +2357,11 @@
>                         /* Out of date!  Destroy the endpoint and replace
> it. */
>                         __endpt_mgr_remove( p_port, *pp_src );
>                         *pp_src = NULL;
> +               }
> +               else if ( ! ((*pp_src)->dlid)) {
> +                       /* Out of date!  Destroy the endpoint and replace
> it. */
> +                       __endpt_mgr_remove( p_port, *pp_src );
> +                       *pp_src = NULL;
>                 }
>                 else if( ipoib_is_voltaire_router_gid( &(*pp_src)->dgid ) )
>                 {
> @@ -4153,10 +4158,25 @@
>         cl_qlist_init( &mc_list );
>
>         cl_obj_lock( &p_port->obj );
> +
>         /* Wait for all readers to complete. */
>         while( p_port->endpt_rdr )
>                 ;
> +#if 0
> +               __endpt_mgr_remove_all(p_port);
> +#else
>
> +       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
> +                                       NDIS_STATUS_MEDIA_DISCONNECT, NULL,
> 0 );
> +       NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
> +
> +       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
> +                                       NDIS_STATUS_MEDIA_CONNECT, NULL, 0
> );
> +       NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
> +
> +                       //      IPOIB_PRINT( TRACE_LEVEL_INFORMATION,
> IPOIB_DBG_INIT,
> +                               //      ("Link DOWN!\n") );
> +
>         if( p_port->p_local_endpt )
>         {
>                 cl_qmap_remove_item( &p_port->endpt_mgr.mac_endpts,
>
>
> XaleX
> -----Original Message-----
> From: Fab Tillier [mailto:ftillier at windows.microsoft.com]
> Sent: Friday, September 12, 2008 6:00 PM
> To: Tzachi Dar; Alex Naslednikov; Reuven Amitai; Leonid Keller
> Cc: ofw at lists.openfabrics.org
> Subject: RE: [ofw] Problem with "Avoid the SM" patch
>
>> Hi Fab,
>>
>> Here is some more information about the issue and one question.
>> There are currently two problems that we see. Both problems start
>> after we restart opensm.
>>
>> 1) After we restart opensm arp messages don't pass. The main reason we
>> saw so far is that they are sent with the wrong addresses. Although we
>> haven't still found exactly why that is, we will soon find that and
>> fix it.
>
> Is it just a problem with the ARP responses, or the requests too?  The
> requests should be getting sent to the broadcast group, so they should
> work.  The response is a unicast packet, so could be getting lost due to the
> dlid == 0 issue.
>
>> 2) This is the more problematic issue: After we restart opensm
>> __endpt_mgr_reset_all is being called. As a result all our endpoint
>> cache is cleared. Please note that windows is not aware of what
>> happened and therefore it doesn't generate arps but rather sends unicast
>> packets.
>> For this packets we don't have enough information in the end point and
>> therefore we can't send them correctly. In the past for this packets
>> we used to do a query on the SM, but we don't want to do that anymore.
>> So my question is this, how do we want to solve this issue:
>> 1) Wait for the windows arp table to flash? Probably too long.
>> 2) Send queries to the SM? We wanted to avoid that.
>> 3) Don't clear the endpoints when opensm is being restarted?
>>    Seems that we might use old data.
>> 4) Send arps by ourselves? Probably the best solution but requires
>>    some more work.
>>
>> What do you think?
>
> I think the key here might be to keep *some* of the SM interaction -
> effectively put a path record cache in IPoIB.  If we kept the existing path
> record query logic in IPoIB the issues with SM restart go away.  We would
> then need to change how the MAC_TO_PATH IOCTL behaved, allowing requests to
> be queued and completed asynchronously.  The IOCTL handler would look up the
> endpoint, and if no path was resolved would issue the path query if it
> wasn't in progress already.  This would require queueing the IRPs and
> tracking them so that a path query completion would complete any pending
> IRPs.
>
> Probably the simplest way to handle this would be to queue the IRPs in the
> IBAT layer when they come in, and then try to flush as many IRPs from the
> queue (look to see if the endpoints have valid paths).  Any endpoint that
> needs a path would have a query issued, and a path query completion would
> again try to flush as many IRPs form the IBAT queue as possible.
>
> The main advantages to this is that real path records would be used for
> unicast traffic as well as IBAT clients, so that the packet rate, MTU, and
> so forth are set optimally, but the cache would be updated whenever an ARP
> response is received, remaining in sync with the network stack.
>
> I hope the SM would not have a problem with path queries like this - the
> query load would grow as the square of number of nodes, rather than the
> square of the number of cores.
>
> -Fab
>
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
>



More information about the ofw mailing list