[ofw] [IPoIB] Problem with "Avoid the SM" patch
Alex Naslednikov
xalex at mellanox.co.il
Thu Sep 18 08:29:27 PDT 2008
Hal,
Below are the answers.
Q1. Are you referring to the SA cache ?
Yes, definitely.
Q2. I also don't understand what you mean by "connect opensm". Do you
mean query the SM/SA ?
Yes.
Q3.What is meant by "IPoIB UP" ? Does this mean "operationally up" ?
What is used to determine this ?
Currently, ipoib_port_up() function immediately starts sending SA
queries to a broadcast group.
Evgeny meant here some additional delay to allow opensm to start.
I understood that Fab checked this issue (by 10 retries of 1 second TO)
and found that it didn't help there.
Yet another try can be enlarging the TO to be 5 sec and sending less
retries
XaleX
-----Original Message-----
From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
Sent: Thursday, September 18, 2008 3:20 PM
To: Alex Naslednikov
Cc: ofw at lists.openfabrics.org; Fab Tillier; Leonid Keller
Subject: Re: [ofw] [IPoIB] Problem with "Avoid the SM" patch
Alex,
On Wed, Sep 17, 2008 at 5:00 AM, Alex Naslednikov <xalex at mellanox.co.il>
wrote:
> I spoke with Evgeny, a Mellanox opensm owner.
> He claims that there were similar try in Linux to avoid the subnet
> manager communication, but currently this feature still has unresolved
> problems and, therefore, disabled by default.
Are you referring to the SA cache ?
> Also, according to Evgeny, the current problem that opensm is not
> scalable (starting from 64x8 MPI jobs) is because we try to connect
> opensm after "PORT UP" event and not after "IPoIB UP" event.
What is meant by "IPoIB UP" ? Does this mean "operationally up" ? What
is used to determine this ?
I also don't understand what you mean by "connect opensm". Do you mean
query the SM/SA ?
-- Hal
> Fab, can you modify your patch in order to allow user select between
> the old and the new solutions ? (i.e. with/without "avoid the sm
> patch)
>
> Thanks,
> XaleX
>
> ________________________________
> From: ofw-bounces at lists.openfabrics.org
> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Alex
> Naslednikov
> Sent: Wednesday, September 17, 2008 11:47 AM
> To: ofw at lists.openfabrics.org; Fab Tillier; Leonid Keller
> Subject: [ofw] [IPoIB] Problem with "Avoid the SM" patch
>
> Reposting this issue to the whole community.
> Current Problem:
> "Avoid the SM" patch caused IPoIB for an invalid cache cleaning. That
> is, when restarting an opensm, IPoIB communication stop to work
> (including
> pings)
> Detailed Description:
> 1.1 After restarting an opensm, __endpt_mgr_reset_all() sets dlid==0
> for the enpoints.
> Thus, ARP should be sent in order to resume the normal communication.
> 1.2 ARP indeed was sent, and even received by the remote side.
> But (put attention), we send ARP by broadcast, but the ARP response is
> always unicast with INVALID DLID.
> Thus, normal communication can't be resumed without ARP response, and
> ARP response can't be send without valid dlid.
>
> Proposed Solution:
> 2.1. When receiving an ARP request dlid is equal to zero, delete this
> endpoint and recreate it.
> 2.2 In order to initialize ARP table (and thus generate ARP requests),
> notify to NDIS link down/link up
>
> Checklist (we executed this checks on 8-node cluster) 1. Run opensm
> and validate that ping works 2. Kill opensm. Ping still should work 3.
> Restart opensm on the same node. Ping should work 4. Rerun #2 5.
> Restart opensm on another node. Ping should work 6. Run another
> instance of opensm, such that the previous instance will switch to
> "standby mode". Ping should work
> OR:
> 6A Run another instance of opensm, such that the previous instance
> will remain in "active mode".
> Then kill active instance. The "standby" instance should enter active
> mode, and a ping should remain.
> Ping should work
> 7.Run 2 different instance of opensm. During the run, clear guid2lid
> file and kill active instance.
> Passive instance will become active and ping still should work
> 8.Change guid2lid file (change lids only) and restart opensm. Ping
> should work IMPORTANT! Validate here that IPoIB adresses didn't
> changed, but lids did , so that pings will be sent to the right host
>
>
> Fix to "Avoid the SM"
> signed-off by: Alexander Naslednikov (xalex at mellanox.co.il)
> ===================================================================
> --- ipoib_port.c (revision 3149)
> +++ ipoib_port.c (working copy)
> @@ -2357,6 +2357,11 @@
> /* Out of date! Destroy the endpoint and
> replace it. */
> __endpt_mgr_remove( p_port, *pp_src );
> *pp_src = NULL;
> + }
> + else if ( ! ((*pp_src)->dlid)) {
> + /* Out of date! Destroy the endpoint and
> + replace
> it. */
> + __endpt_mgr_remove( p_port, *pp_src );
> + *pp_src = NULL;
> }
> else if( ipoib_is_voltaire_router_gid(
&(*pp_src)->dgid ) )
> {
> @@ -4153,10 +4158,25 @@
> cl_qlist_init( &mc_list );
>
> cl_obj_lock( &p_port->obj );
> +
> /* Wait for all readers to complete. */
> while( p_port->endpt_rdr )
> ;
> +#if 0
> + __endpt_mgr_remove_all(p_port); #else
>
> + NdisMIndicateStatus( p_port->p_adapter->h_adapter,
> + NDIS_STATUS_MEDIA_DISCONNECT,
> + NULL,
> 0 );
> + NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
> +
> + NdisMIndicateStatus( p_port->p_adapter->h_adapter,
> + NDIS_STATUS_MEDIA_CONNECT,
> + NULL, 0
> );
> + NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
> +
> +
> if( p_port->p_local_endpt )
> {
> cl_qmap_remove_item( &p_port->endpt_mgr.mac_endpts,
>
>
>
> -----Original Message-----
> From: Alex Naslednikov
> Sent: Sunday, September 14, 2008 5:32 PM
> To: 'Fab Tillier'; Tzachi Dar; Reuven Amitai; Leonid Keller
> Cc: Ishai Rabinovitz
> Subject: RE: [ofw] Problem with "Avoid the SM" patch
>
> I'd like just to summarize all we said before and to propose a
> temporarily solution.
>
> 1. The Problem
> 1.1 After restarting an opensm, __endpt_mgr_reset_all() sets dlid==0
> for the enpoints.
> Thus, ARP should be sent in order to resume the normal communication.
> 1.2 ARP indeed was sent, and even received by the remote side.
> But (put attention), we send ARP by broadcast, but the ARP response is
> always unicast with INVALID DLID.
> Thus, normal communication can't be resumed withoud ARP response, and
> ARP response can't be send without valid dlid.
>
> So, in order to resolve it, there's our proposal. It's a temporary
> solution only.
> Of course, it should be investigated on a large cluster
>
> 2. The solution
> 2.1. When receiving an ARP request dlid is equal to zero, delete this
> endpoint and recreate it.
> 2.2 In order to initialize ARP table (and thus generate ARP requests),
> notify to NDIS link down/link up
>
>
> ===================================================================
> --- ipoib_port.c (revision 3149)
> +++ ipoib_port.c (working copy)
> @@ -2357,6 +2357,11 @@
> /* Out of date! Destroy the endpoint and
> replace it. */
> __endpt_mgr_remove( p_port, *pp_src );
> *pp_src = NULL;
> + }
> + else if ( ! ((*pp_src)->dlid)) {
> + /* Out of date! Destroy the endpoint and
> + replace
> it. */
> + __endpt_mgr_remove( p_port, *pp_src );
> + *pp_src = NULL;
> }
> else if( ipoib_is_voltaire_router_gid(
&(*pp_src)->dgid ) )
> {
> @@ -4153,10 +4158,25 @@
> cl_qlist_init( &mc_list );
>
> cl_obj_lock( &p_port->obj );
> +
> /* Wait for all readers to complete. */
> while( p_port->endpt_rdr )
> ;
> +#if 0
> + __endpt_mgr_remove_all(p_port); #else
>
> + NdisMIndicateStatus( p_port->p_adapter->h_adapter,
> + NDIS_STATUS_MEDIA_DISCONNECT,
> + NULL,
> 0 );
> + NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
> +
> + NdisMIndicateStatus( p_port->p_adapter->h_adapter,
> + NDIS_STATUS_MEDIA_CONNECT,
> + NULL, 0
> );
> + NdisMIndicateStatusComplete( p_port->p_adapter->h_adapter );
> +
> + // IPOIB_PRINT( TRACE_LEVEL_INFORMATION,
> IPOIB_DBG_INIT,
> + // ("Link DOWN!\n") );
> +
> if( p_port->p_local_endpt )
> {
> cl_qmap_remove_item( &p_port->endpt_mgr.mac_endpts,
>
>
> XaleX
> -----Original Message-----
> From: Fab Tillier [mailto:ftillier at windows.microsoft.com]
> Sent: Friday, September 12, 2008 6:00 PM
> To: Tzachi Dar; Alex Naslednikov; Reuven Amitai; Leonid Keller
> Cc: ofw at lists.openfabrics.org
> Subject: RE: [ofw] Problem with "Avoid the SM" patch
>
>> Hi Fab,
>>
>> Here is some more information about the issue and one question.
>> There are currently two problems that we see. Both problems start
>> after we restart opensm.
>>
>> 1) After we restart opensm arp messages don't pass. The main reason
>> we saw so far is that they are sent with the wrong addresses.
>> Although we haven't still found exactly why that is, we will soon
>> find that and fix it.
>
> Is it just a problem with the ARP responses, or the requests too? The
> requests should be getting sent to the broadcast group, so they should
> work. The response is a unicast packet, so could be getting lost due
> to the dlid == 0 issue.
>
>> 2) This is the more problematic issue: After we restart opensm
>> __endpt_mgr_reset_all is being called. As a result all our endpoint
>> cache is cleared. Please note that windows is not aware of what
>> happened and therefore it doesn't generate arps but rather sends
>> unicast packets.
>> For this packets we don't have enough information in the end point
>> and therefore we can't send them correctly. In the past for this
>> packets we used to do a query on the SM, but we don't want to do that
anymore.
>> So my question is this, how do we want to solve this issue:
>> 1) Wait for the windows arp table to flash? Probably too long.
>> 2) Send queries to the SM? We wanted to avoid that.
>> 3) Don't clear the endpoints when opensm is being restarted?
>> Seems that we might use old data.
>> 4) Send arps by ourselves? Probably the best solution but requires
>> some more work.
>>
>> What do you think?
>
> I think the key here might be to keep *some* of the SM interaction -
> effectively put a path record cache in IPoIB. If we kept the existing
> path record query logic in IPoIB the issues with SM restart go away.
> We would then need to change how the MAC_TO_PATH IOCTL behaved,
> allowing requests to be queued and completed asynchronously. The
> IOCTL handler would look up the endpoint, and if no path was resolved
> would issue the path query if it wasn't in progress already. This
> would require queueing the IRPs and tracking them so that a path query
> completion would complete any pending IRPs.
>
> Probably the simplest way to handle this would be to queue the IRPs in
> the IBAT layer when they come in, and then try to flush as many IRPs
> from the queue (look to see if the endpoints have valid paths). Any
> endpoint that needs a path would have a query issued, and a path query
> completion would again try to flush as many IRPs form the IBAT queue
as possible.
>
> The main advantages to this is that real path records would be used
> for unicast traffic as well as IBAT clients, so that the packet rate,
> MTU, and so forth are set optimally, but the cache would be updated
> whenever an ARP response is received, remaining in sync with the
network stack.
>
> I hope the SM would not have a problem with path queries like this -
> the query load would grow as the square of number of nodes, rather
> than the square of the number of cores.
>
> -Fab
>
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
>
More information about the ofw
mailing list