***SPAM*** Re: [ofw] [IPoIB] Problem with "Avoid the SM" patch

Hal Rosenstock hal.rosenstock at gmail.com
Thu Sep 18 08:44:49 PDT 2008


Alex,

On Thu, Sep 18, 2008 at 11:29 AM, Alex Naslednikov <xalex at mellanox.co.il> wrote:
> Hal,
> Below are the answers.
>
> Q1. Are you referring to the SA cache ?
> Yes, definitely.
>
> Q2. I also don't understand what you mean by "connect opensm". Do you
> mean query the SM/SA ?
> Yes.
>
> Q3.What is meant by "IPoIB UP" ? Does this mean "operationally up" ?
> What is used to determine this ?
> Currently, ipoib_port_up() function immediately starts sending SA
> queries to a broadcast group.

I don't understand what you mean by this. SA queries do not go on the
broadcast group; ARPs might.

At this point, has the broadcast group been successfully joined ?

> Evgeny meant here some additional delay to allow opensm to start.

I think you mean respond (and it's any SM)...

> I understood that Fab checked this issue (by 10 retries of 1 second TO)
> and found that it didn't help there.
> Yet another try can be enlarging the TO to be 5 sec and sending less
> retries

I think some exponential backoff strategy with some randomization
might be better.

-- Hal

> XaleX
>
>
> -----Original Message-----
> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> Sent: Thursday, September 18, 2008 3:20 PM
> To: Alex Naslednikov
> Cc: ofw at lists.openfabrics.org; Fab Tillier; Leonid Keller
> Subject: Re: [ofw] [IPoIB] Problem with "Avoid the SM" patch
>
> Alex,
>
> On Wed, Sep 17, 2008 at 5:00 AM, Alex Naslednikov <xalex at mellanox.co.il>
> wrote:
>> I spoke with Evgeny, a Mellanox opensm owner.
>> He claims that there were similar try in Linux to avoid the subnet
>> manager communication, but currently this feature still has unresolved
>
>> problems and, therefore, disabled by default.
>
> Are you referring to the SA cache ?
>
>
>> Also, according to Evgeny, the current problem that opensm is not
>> scalable (starting from 64x8 MPI jobs) is because we try to connect
>> opensm after "PORT UP" event and not after  "IPoIB UP" event.
>
> What is meant by "IPoIB UP" ? Does this mean "operationally up" ? What
> is used to determine this ?
>
> I also don't understand what you mean by "connect opensm". Do you mean
> query the SM/SA ?
>
>
> -- Hal
>
>> Fab, can you modify your patch in order to allow user select between
>> the old and the new solutions ? (i.e. with/without "avoid the sm
>> patch)
>>
>> Thanks,
>> XaleX
>>
>> ________________________________
>> From: ofw-bounces at lists.openfabrics.org
>> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Alex
>> Naslednikov
>> Sent: Wednesday, September 17, 2008 11:47 AM
>> To: ofw at lists.openfabrics.org; Fab Tillier; Leonid Keller
>> Subject: [ofw] [IPoIB] Problem with "Avoid the SM" patch
>>
>> Reposting this issue to the whole community.
>> Current Problem:
>> "Avoid the SM" patch caused IPoIB for an invalid cache cleaning. That
>> is, when restarting an opensm, IPoIB communication stop to work
>> (including
>> pings)
>> Detailed Description:
>> 1.1 After restarting an opensm, __endpt_mgr_reset_all() sets dlid==0
>> for the enpoints.
>> Thus, ARP should be sent in order to resume the normal communication.
>> 1.2 ARP indeed was sent, and even received by the remote side.
>> But (put attention), we send ARP by broadcast, but the ARP response is
>
>> always unicast with INVALID DLID.
>> Thus, normal communication can't be resumed without ARP response, and
>> ARP response can't be send without valid dlid.
>>
>> Proposed Solution:
>> 2.1. When receiving an ARP request dlid is equal to zero, delete this
>> endpoint and recreate it.
>> 2.2 In order to initialize ARP table (and thus generate ARP requests),
>
>> notify to NDIS link down/link up
>>
>> Checklist (we executed this checks on 8-node cluster) 1. Run opensm
>> and validate that ping works 2. Kill opensm. Ping still should work 3.
>
>> Restart opensm on the same node. Ping should work 4. Rerun #2 5.
>> Restart opensm on another node. Ping should work 6. Run another
>> instance of opensm, such that the previous instance will switch to
>> "standby mode". Ping should work
>> OR:
>> 6A Run another instance of opensm, such that the previous instance
>> will remain in "active mode".
>> Then kill active instance. The "standby" instance should enter active
>> mode, and a ping should remain.
>> Ping should work
>> 7.Run 2 different instance of opensm. During the run, clear guid2lid
>> file and kill active instance.
>> Passive instance will become active and ping still should work
>> 8.Change guid2lid file (change lids only) and restart opensm. Ping
>> should work IMPORTANT! Validate here that IPoIB adresses didn't
>> changed, but lids did , so that pings will be sent to the right host
>>
>>
>> Fix to "Avoid the SM"
>> signed-off by: Alexander Naslednikov (xalex at mellanox.co.il)
>> ===================================================================
>> --- ipoib_port.c        (revision 3149)
>> +++ ipoib_port.c        (working copy)
>> @@ -2357,6 +2357,11 @@
>>                         /* Out of date!  Destroy the endpoint and
>> replace it. */
>>                         __endpt_mgr_remove( p_port, *pp_src );
>>                         *pp_src = NULL;
>> +               }
>> +               else if ( ! ((*pp_src)->dlid)) {
>> +                       /* Out of date!  Destroy the endpoint and
>> + replace
>> it. */
>> +                       __endpt_mgr_remove( p_port, *pp_src );
>> +                       *pp_src = NULL;
>>                 }
>>                 else if( ipoib_is_voltaire_router_gid(
> &(*pp_src)->dgid ) )
>>                 {
>> @@ -4153,10 +4158,25 @@
>>         cl_qlist_init( &mc_list );
>>
>>         cl_obj_lock( &p_port->obj );
>> +
>>         /* Wait for all readers to complete. */
>>         while( p_port->endpt_rdr )
>>                 ;
>> +#if 0
>> +               __endpt_mgr_remove_all(p_port); #else
>>
>> +       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
>> +                                       NDIS_STATUS_MEDIA_DISCONNECT,
>> + NULL,
>> 0 );
>> +       NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
>> +
>> +       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
>> +                                       NDIS_STATUS_MEDIA_CONNECT,
>> + NULL, 0
>> );
>> +       NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
>> +
>> +
>>         if( p_port->p_local_endpt )
>>         {
>>                 cl_qmap_remove_item( &p_port->endpt_mgr.mac_endpts,
>>
>>
>>
>> -----Original Message-----
>> From: Alex Naslednikov
>> Sent: Sunday, September 14, 2008 5:32 PM
>> To: 'Fab Tillier'; Tzachi Dar; Reuven Amitai; Leonid Keller
>> Cc: Ishai Rabinovitz
>> Subject: RE: [ofw] Problem with "Avoid the SM" patch
>>
>> I'd like just to summarize all we said before and to propose a
>> temporarily solution.
>>
>> 1. The Problem
>> 1.1 After restarting an opensm, __endpt_mgr_reset_all() sets dlid==0
>> for the enpoints.
>> Thus, ARP should be sent in order to resume the normal communication.
>> 1.2 ARP indeed was sent, and even received by the remote side.
>> But (put attention), we send ARP by broadcast, but the ARP response is
>
>> always unicast with INVALID DLID.
>> Thus, normal communication can't be resumed withoud ARP response, and
>> ARP response can't be send without valid dlid.
>>
>> So, in order to resolve it, there's our proposal. It's a temporary
>> solution only.
>> Of course, it should be investigated on a large cluster
>>
>> 2. The solution
>> 2.1. When receiving an ARP request dlid is equal to zero, delete this
>> endpoint and recreate it.
>> 2.2 In order to initialize ARP table (and thus generate ARP requests),
>
>> notify to NDIS link down/link up
>>
>>
>> ===================================================================
>> --- ipoib_port.c        (revision 3149)
>> +++ ipoib_port.c        (working copy)
>> @@ -2357,6 +2357,11 @@
>>                         /* Out of date!  Destroy the endpoint and
>> replace it. */
>>                         __endpt_mgr_remove( p_port, *pp_src );
>>                         *pp_src = NULL;
>> +               }
>> +               else if ( ! ((*pp_src)->dlid)) {
>> +                       /* Out of date!  Destroy the endpoint and
>> + replace
>> it. */
>> +                       __endpt_mgr_remove( p_port, *pp_src );
>> +                       *pp_src = NULL;
>>                 }
>>                 else if( ipoib_is_voltaire_router_gid(
> &(*pp_src)->dgid ) )
>>                 {
>> @@ -4153,10 +4158,25 @@
>>         cl_qlist_init( &mc_list );
>>
>>         cl_obj_lock( &p_port->obj );
>> +
>>         /* Wait for all readers to complete. */
>>         while( p_port->endpt_rdr )
>>                 ;
>> +#if 0
>> +               __endpt_mgr_remove_all(p_port); #else
>>
>> +       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
>> +                                       NDIS_STATUS_MEDIA_DISCONNECT,
>> + NULL,
>> 0 );
>> +       NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
>> +
>> +       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
>> +                                       NDIS_STATUS_MEDIA_CONNECT,
>> + NULL, 0
>> );
>> +       NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
>> +
>> +                       //      IPOIB_PRINT( TRACE_LEVEL_INFORMATION,
>> IPOIB_DBG_INIT,
>> +                               //      ("Link DOWN!\n") );
>> +
>>         if( p_port->p_local_endpt )
>>         {
>>                 cl_qmap_remove_item( &p_port->endpt_mgr.mac_endpts,
>>
>>
>> XaleX
>> -----Original Message-----
>> From: Fab Tillier [mailto:ftillier at windows.microsoft.com]
>> Sent: Friday, September 12, 2008 6:00 PM
>> To: Tzachi Dar; Alex Naslednikov; Reuven Amitai; Leonid Keller
>> Cc: ofw at lists.openfabrics.org
>> Subject: RE: [ofw] Problem with "Avoid the SM" patch
>>
>>> Hi Fab,
>>>
>>> Here is some more information about the issue and one question.
>>> There are currently two problems that we see. Both problems start
>>> after we restart opensm.
>>>
>>> 1) After we restart opensm arp messages don't pass. The main reason
>>> we saw so far is that they are sent with the wrong addresses.
>>> Although we haven't still found exactly why that is, we will soon
>>> find that and fix it.
>>
>> Is it just a problem with the ARP responses, or the requests too?  The
>
>> requests should be getting sent to the broadcast group, so they should
>
>> work.  The response is a unicast packet, so could be getting lost due
>> to the dlid == 0 issue.
>>
>>> 2) This is the more problematic issue: After we restart opensm
>>> __endpt_mgr_reset_all is being called. As a result all our endpoint
>>> cache is cleared. Please note that windows is not aware of what
>>> happened and therefore it doesn't generate arps but rather sends
>>> unicast packets.
>>> For this packets we don't have enough information in the end point
>>> and therefore we can't send them correctly. In the past for this
>>> packets we used to do a query on the SM, but we don't want to do that
> anymore.
>>> So my question is this, how do we want to solve this issue:
>>> 1) Wait for the windows arp table to flash? Probably too long.
>>> 2) Send queries to the SM? We wanted to avoid that.
>>> 3) Don't clear the endpoints when opensm is being restarted?
>>>    Seems that we might use old data.
>>> 4) Send arps by ourselves? Probably the best solution but requires
>>>    some more work.
>>>
>>> What do you think?
>>
>> I think the key here might be to keep *some* of the SM interaction -
>> effectively put a path record cache in IPoIB.  If we kept the existing
>
>> path record query logic in IPoIB the issues with SM restart go away.
>> We would then need to change how the MAC_TO_PATH IOCTL behaved,
>> allowing requests to be queued and completed asynchronously.  The
>> IOCTL handler would look up the endpoint, and if no path was resolved
>> would issue the path query if it wasn't in progress already.  This
>> would require queueing the IRPs and tracking them so that a path query
>
>> completion would complete any pending IRPs.
>>
>> Probably the simplest way to handle this would be to queue the IRPs in
>
>> the IBAT layer when they come in, and then try to flush as many IRPs
>> from the queue (look to see if the endpoints have valid paths).  Any
>> endpoint that needs a path would have a query issued, and a path query
>
>> completion would again try to flush as many IRPs form the IBAT queue
> as possible.
>>
>> The main advantages to this is that real path records would be used
>> for unicast traffic as well as IBAT clients, so that the packet rate,
>> MTU, and so forth are set optimally, but the cache would be updated
>> whenever an ARP response is received, remaining in sync with the
> network stack.
>>
>> I hope the SM would not have a problem with path queries like this -
>> the query load would grow as the square of number of nodes, rather
>> than the square of the number of cores.
>>
>> -Fab
>>
>> _______________________________________________
>> ofw mailing list
>> ofw at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
>>
>



More information about the ofw mailing list