[ofw] [IPoIB] Problem with "Avoid the SM" patch

Thu Sep 18 08:29:27 PDT 2008

Hal,
Below are the answers.

Q1. Are you referring to the SA cache ?
Yes, definitely.

Q2. I also don't understand what you mean by "connect opensm". Do you
mean query the SM/SA ?
Yes.

Q3.What is meant by "IPoIB UP" ? Does this mean "operationally up" ?
What is used to determine this ?
Currently, ipoib_port_up() function immediately starts sending SA
queries to a broadcast group.
Evgeny meant here some additional delay to allow opensm to start.
I understood that Fab checked this issue (by 10 retries of 1 second TO)
and found that it didn't help there.
Yet another try can be enlarging the TO to be 5 sec and sending less
retries

XaleX

-----Original Message-----
From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] 
Sent: Thursday, September 18, 2008 3:20 PM
To: Alex Naslednikov
Cc: ofw at lists.openfabrics.org; Fab Tillier; Leonid Keller
Subject: Re: [ofw] [IPoIB] Problem with "Avoid the SM" patch

Alex,

On Wed, Sep 17, 2008 at 5:00 AM, Alex Naslednikov <xalex at mellanox.co.il>
wrote:
> I spoke with Evgeny, a Mellanox opensm owner.
> He claims that there were similar try in Linux to avoid the subnet 
> manager communication, but currently this feature still has unresolved

> problems and, therefore, disabled by default.

Are you referring to the SA cache ?

> Also, according to Evgeny, the current problem that opensm is not 
> scalable (starting from 64x8 MPI jobs) is because we try to connect 
> opensm after "PORT UP" event and not after  "IPoIB UP" event.

What is meant by "IPoIB UP" ? Does this mean "operationally up" ? What
is used to determine this ?

I also don't understand what you mean by "connect opensm". Do you mean
query the SM/SA ?

-- Hal

> Fab, can you modify your patch in order to allow user select between 
> the old and the new solutions ? (i.e. with/without "avoid the sm 
> patch)
>
> Thanks,
> XaleX
>
> ________________________________
> From: ofw-bounces at lists.openfabrics.org 
> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Alex 
> Naslednikov
> Sent: Wednesday, September 17, 2008 11:47 AM
> To: ofw at lists.openfabrics.org; Fab Tillier; Leonid Keller
> Subject: [ofw] [IPoIB] Problem with "Avoid the SM" patch
>
> Reposting this issue to the whole community.
> Current Problem:
> "Avoid the SM" patch caused IPoIB for an invalid cache cleaning. That 
> is, when restarting an opensm, IPoIB communication stop to work 
> (including
> pings)
> Detailed Description:
> 1.1 After restarting an opensm, __endpt_mgr_reset_all() sets dlid==0 
> for the enpoints.
> Thus, ARP should be sent in order to resume the normal communication.
> 1.2 ARP indeed was sent, and even received by the remote side.
> But (put attention), we send ARP by broadcast, but the ARP response is

> always unicast with INVALID DLID.
> Thus, normal communication can't be resumed without ARP response, and 
> ARP response can't be send without valid dlid.
>
> Proposed Solution:
> 2.1. When receiving an ARP request dlid is equal to zero, delete this 
> endpoint and recreate it.
> 2.2 In order to initialize ARP table (and thus generate ARP requests),

> notify to NDIS link down/link up
>
> Checklist (we executed this checks on 8-node cluster) 1. Run opensm 
> and validate that ping works 2. Kill opensm. Ping still should work 3.

> Restart opensm on the same node. Ping should work 4. Rerun #2 5. 
> Restart opensm on another node. Ping should work 6. Run another 
> instance of opensm, such that the previous instance will switch to 
> "standby mode". Ping should work
> OR:
> 6A Run another instance of opensm, such that the previous instance 
> will remain in "active mode".
> Then kill active instance. The "standby" instance should enter active 
> mode, and a ping should remain.
> Ping should work
> 7.Run 2 different instance of opensm. During the run, clear guid2lid 
> file and kill active instance.
> Passive instance will become active and ping still should work 
> 8.Change guid2lid file (change lids only) and restart opensm. Ping 
> should work IMPORTANT! Validate here that IPoIB adresses didn't 
> changed, but lids did , so that pings will be sent to the right host
>
>
> Fix to "Avoid the SM"
> signed-off by: Alexander Naslednikov (xalex at mellanox.co.il) 
> ===================================================================
> --- ipoib_port.c        (revision 3149)
> +++ ipoib_port.c        (working copy)
> @@ -2357,6 +2357,11 @@
>                         /* Out of date!  Destroy the endpoint and 
> replace it. */
>                         __endpt_mgr_remove( p_port, *pp_src );
>                         *pp_src = NULL;
> +               }
> +               else if ( ! ((*pp_src)->dlid)) {
> +                       /* Out of date!  Destroy the endpoint and 
> + replace
> it. */
> +                       __endpt_mgr_remove( p_port, *pp_src );
> +                       *pp_src = NULL;
>                 }
>                 else if( ipoib_is_voltaire_router_gid(
&(*pp_src)->dgid ) )
>                 {
> @@ -4153,10 +4158,25 @@
>         cl_qlist_init( &mc_list );
>
>         cl_obj_lock( &p_port->obj );
> +
>         /* Wait for all readers to complete. */
>         while( p_port->endpt_rdr )
>                 ;
> +#if 0
> +               __endpt_mgr_remove_all(p_port); #else
>
> +       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
> +                                       NDIS_STATUS_MEDIA_DISCONNECT, 
> + NULL,
> 0 );
> +       NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
> +
> +       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
> +                                       NDIS_STATUS_MEDIA_CONNECT, 
> + NULL, 0
> );
> +       NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
> +
> +
>         if( p_port->p_local_endpt )
>         {
>                 cl_qmap_remove_item( &p_port->endpt_mgr.mac_endpts,
>
>
>
> -----Original Message-----
> From: Alex Naslednikov
> Sent: Sunday, September 14, 2008 5:32 PM
> To: 'Fab Tillier'; Tzachi Dar; Reuven Amitai; Leonid Keller
> Cc: Ishai Rabinovitz
> Subject: RE: [ofw] Problem with "Avoid the SM" patch
>
> I'd like just to summarize all we said before and to propose a 
> temporarily solution.
>
> 1. The Problem
> 1.1 After restarting an opensm, __endpt_mgr_reset_all() sets dlid==0 
> for the enpoints.
> Thus, ARP should be sent in order to resume the normal communication.
> 1.2 ARP indeed was sent, and even received by the remote side.
> But (put attention), we send ARP by broadcast, but the ARP response is

> always unicast with INVALID DLID.
> Thus, normal communication can't be resumed withoud ARP response, and 
> ARP response can't be send without valid dlid.
>
> So, in order to resolve it, there's our proposal. It's a temporary 
> solution only.
> Of course, it should be investigated on a large cluster
>
> 2. The solution
> 2.1. When receiving an ARP request dlid is equal to zero, delete this 
> endpoint and recreate it.
> 2.2 In order to initialize ARP table (and thus generate ARP requests),

> notify to NDIS link down/link up
>
>
> ===================================================================
> --- ipoib_port.c        (revision 3149)
> +++ ipoib_port.c        (working copy)
> @@ -2357,6 +2357,11 @@
>                         /* Out of date!  Destroy the endpoint and 
> replace it. */
>                         __endpt_mgr_remove( p_port, *pp_src );
>                         *pp_src = NULL;
> +               }
> +               else if ( ! ((*pp_src)->dlid)) {
> +                       /* Out of date!  Destroy the endpoint and 
> + replace
> it. */
> +                       __endpt_mgr_remove( p_port, *pp_src );
> +                       *pp_src = NULL;
>                 }
>                 else if( ipoib_is_voltaire_router_gid(
&(*pp_src)->dgid ) )
>                 {
> @@ -4153,10 +4158,25 @@
>         cl_qlist_init( &mc_list );
>
>         cl_obj_lock( &p_port->obj );
> +
>         /* Wait for all readers to complete. */
>         while( p_port->endpt_rdr )
>                 ;
> +#if 0
> +               __endpt_mgr_remove_all(p_port); #else
>
> +       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
> +                                       NDIS_STATUS_MEDIA_DISCONNECT, 
> + NULL,
> 0 );
> +       NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
> +
> +       NdisMIndicateStatus( p_port->p_adapter->h_adapter,
> +                                       NDIS_STATUS_MEDIA_CONNECT, 
> + NULL, 0
> );
> +       NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
> +
> +                       //      IPOIB_PRINT( TRACE_LEVEL_INFORMATION,
> IPOIB_DBG_INIT,
> +                               //      ("Link DOWN!\n") );
> +
>         if( p_port->p_local_endpt )
>         {
>                 cl_qmap_remove_item( &p_port->endpt_mgr.mac_endpts,
>
>
> XaleX
> -----Original Message-----
> From: Fab Tillier [mailto:ftillier at windows.microsoft.com]
> Sent: Friday, September 12, 2008 6:00 PM
> To: Tzachi Dar; Alex Naslednikov; Reuven Amitai; Leonid Keller
> Cc: ofw at lists.openfabrics.org
> Subject: RE: [ofw] Problem with "Avoid the SM" patch
>
>> Hi Fab,
>>
>> Here is some more information about the issue and one question.
>> There are currently two problems that we see. Both problems start 
>> after we restart opensm.
>>
>> 1) After we restart opensm arp messages don't pass. The main reason 
>> we saw so far is that they are sent with the wrong addresses. 
>> Although we haven't still found exactly why that is, we will soon 
>> find that and fix it.
>
> Is it just a problem with the ARP responses, or the requests too?  The

> requests should be getting sent to the broadcast group, so they should

> work.  The response is a unicast packet, so could be getting lost due 
> to the dlid == 0 issue.
>
>> 2) This is the more problematic issue: After we restart opensm 
>> __endpt_mgr_reset_all is being called. As a result all our endpoint 
>> cache is cleared. Please note that windows is not aware of what 
>> happened and therefore it doesn't generate arps but rather sends 
>> unicast packets.
>> For this packets we don't have enough information in the end point 
>> and therefore we can't send them correctly. In the past for this 
>> packets we used to do a query on the SM, but we don't want to do that
anymore.
>> So my question is this, how do we want to solve this issue:
>> 1) Wait for the windows arp table to flash? Probably too long.
>> 2) Send queries to the SM? We wanted to avoid that.
>> 3) Don't clear the endpoints when opensm is being restarted?
>>    Seems that we might use old data.
>> 4) Send arps by ourselves? Probably the best solution but requires
>>    some more work.
>>
>> What do you think?
>
> I think the key here might be to keep *some* of the SM interaction - 
> effectively put a path record cache in IPoIB.  If we kept the existing

> path record query logic in IPoIB the issues with SM restart go away.  
> We would then need to change how the MAC_TO_PATH IOCTL behaved, 
> allowing requests to be queued and completed asynchronously.  The 
> IOCTL handler would look up the endpoint, and if no path was resolved 
> would issue the path query if it wasn't in progress already.  This 
> would require queueing the IRPs and tracking them so that a path query

> completion would complete any pending IRPs.
>
> Probably the simplest way to handle this would be to queue the IRPs in

> the IBAT layer when they come in, and then try to flush as many IRPs 
> from the queue (look to see if the endpoints have valid paths).  Any 
> endpoint that needs a path would have a query issued, and a path query

> completion would again try to flush as many IRPs form the IBAT queue
as possible.
>
> The main advantages to this is that real path records would be used 
> for unicast traffic as well as IBAT clients, so that the packet rate, 
> MTU, and so forth are set optimally, but the cache would be updated 
> whenever an ARP response is received, remaining in sync with the
network stack.
>
> I hope the SM would not have a problem with path queries like this - 
> the query load would grow as the square of number of nodes, rather 
> than the square of the number of cores.
>
> -Fab
>
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
>