[ofw] RE: [IPoIB CM] [Patch] ARP REP should be send in UD mode (connectivity issue)

Alex Estrin alex.estrin at qlogic.com
Thu Jan 22 08:17:50 PST 2009


Please see below 

> -----Original Message-----
> From: Alex Naslednikov [mailto:xalex at mellanox.co.il] 
> Sent: Thursday, January 22, 2009 10:35 AM
> To: Alex Estrin; ofw at lists.openfabrics.org
> Cc: Leonid Keller
> Subject: RE: [IPoIB CM] [Patch] ARP REP should be send in UD 
> mode (connectivity issue)
> 
> Please, see inline 
> 
> -----Original Message-----
> From: Alex Estrin [mailto:alex.estrin at qlogic.com] 
> Sent: Thursday, January 22, 2009 4:44 PM
> To: Alex Naslednikov; ofw at lists.openfabrics.org
> Subject: RE: [IPoIB CM] [Patch] ARP REP should be send in UD mode
> (connectivity issue)
> 
> > I believe that reducing connection timeout will partially 
> resolve the 
> > issue.
> > But RFC defines ARP REP to be sent through UD QP, and it solves the 
> > problem.
> 
> All ARPs are sent through UD QP. Please see 
> __send_mgr_filter() and look
> for:
> 
> 	case ETH_PROT_TYPE_ARP:
> 		cl_perf_start( FilterArp );
> 		status = __send_mgr_filter_arp(
> 			p_port, p_eth_hdr, p_buf, buf_len, p_desc );
> 		p_desc->send_dir = SEND_UD_QP;
> 		cl_perf_stop( &p_port->p_adapter->perf, FilterArp );
> 		break;
> 
> Then, later in __build_send_desc()
> 
> 		if( p_desc->send_dir == SEND_UD_QP )
> 		{
> 			p_desc->send_qp = p_port->ib_mgr.h_qp; // UD QP
> ...
> 
> [XaleX] Yes, you are right, queued ARP REPs will be also sent 
> via UD QP.
> 	But my point was not to delay such packet untill the connection
> time-out will expire. 

Delay was implemented to ensure all TCP traffic that can immediately follow ARP REP
go through connected RC QP and won't fail going through UD because of packet size.

> And I can't understand how can we get connected
> reply back in this case.

Please see __conn_reply_cb() in ipoib_cm.c. 
This callback fired by CEP manager on received CREP.
Here we send RTU( which allows CEP manager to move RC QP to RTS) and set endpoint to IPOIB_CM_CONNECTED state,
Then we finally  can resume sending our queued ARP by calling ipoib_port_resume();

> 	If I understood you right, you postpone the ARP response to be
> sure that all TCP applications will work through RC QP.

Correct. Please see above.

> 	So, we can do the following: destroy appropriate CEP objects and
> recreate them again.
> 	What's your opinion regards this ?

Destroy CEP occurred while destroying endpoint object.
In case of connect timeout it is expected to happened in __conn_rej_cb()
New endpoint will be created on next ARP REPLY attempt.
Also note that RC QP and connect REQ during ARP REPLY will be initiated only for endpoint
that indicates CM capabilities in it's ARP REQ packet, so this endpoint will be treated accordingly.

> > So why do we try to debug RC flow ?
> 
> Connection request is issued in context of send ARP REPLY. 
> While processing of ARP REPLY packet itself get delayed until 
> connection
> succeed or timed out.
> 
> Thanks,
> Alex.
> 
> > -----Original Message-----
> > From: Alex Estrin [mailto:alex.estrin at qlogic.com]
> > Sent: Thursday, January 22, 2009 4:14 PM
> > To: Alex Naslednikov; ofw at lists.openfabrics.org
> > Subject: RE: [IPoIB CM] [Patch] ARP REP should be send in UD mode 
> > (connectivity issue)
> > 
> > Hello,
> > 
> > When Responder generates ARP REP packet and endpoint is in 
> > IPOIB_CM_DISCONNECTED state, it will move endpoint to 
> transition state
> 
> > IPOIB_CM_CONNECT, initiate connect request for that 
> endpoint and queue
> 
> > ARP REP packet.
> > [Xalex]
> > When connection is established (endpoint in state 
> IPOIB_CM_CONNECTED) 
> > ARP REP will resume through UD QP.
> > TCP applications start sending TCP packets immediately after it 
> > received ARP REP, so delaying it would make sure all TCP packets go 
> > through connected QP.
> > In your case I think host didn't get connect reply back.
> > Not sure why, but it is likely something went wrong with a 
> path record
> 
> > either locally or on remote side.
> > We need to look deeper in this.
> > Also we could probably reduce connect timeout, or connect 
> retries so 
> > host can reinit endpoint(on connection timeout) and on next ARP REP 
> > will retry connect again.
> > Please see more notes inline.
> > 
> > Thanks,
> > Alex.
> > 
> > > Hello,Alex,
> > > Recently, I found the following problem:
> > > 1. Connect 2 machines B2B, run opensm, set static IPoIB adresses, 
> > > verify ping.
> > > 2. Then disconnect a cable for 10-15 seconds, and connect 
> it back 3.
> > > Wait for a couple of seconds for opensm to indicate that
> > the links is
> > > UP, then try to ping again.
> > > 4. The ping now will not work
> > >
> > > Why this happens:
> > > 1. On the sender side, ping (ARP REQ) packet will be 
> generated and 
> > > sent to the responder size 2. Responder will generate ARP
> > REP packet,
> > > but it will be not sent:
> > > in recv_mgr_filter_arp, when getting to IPOIB_CM_DISCONNECTED or 
> > > IPOIB_CM_DISCONNECTED, the code wil return 
> NDIS_STATUS_PENDING, and 
> > > these ARP REPs will be queued 3. Now, CEP manager will not
> > be able to
> > > restore the communication, because of no response for ARP 
> packets :)
> 
> > > 4. Sending ARP REP in UD mode will resolve this issue
> > >
> > > Patch: ARP REP should be send in UD mode Signed-off by: Alexander 
> > > Naslednikov (xalex at mellanox.co.il)
> > > Index: ipoib_port.c
> > > 
> ===================================================================
> > > --- ipoib_port.c      (revision 3775)
> > > +++ ipoib_port.c      (working copy)
> > > @@ -4098,70 +4101,12 @@
> > >                       return status;
> > >               }
> > >               ipoib_addr_set_qpn( &p_ib_arp->dst_hw, qpn );
> > > -
> > > -             if( p_arp->op == ARP_OP_REP &&
> > > -                     p_port->p_adapter->params.cm_enabled &&
> > > -                     p_desc->p_endpt->cm_flag == 
> IPOIB_CM_FLAG_RC )
> > > -             {
> > > -                     cm_state_t      cm_state;
> > > -                     cm_state =
> > > -                             ( cm_state_t
> > > )InterlockedCompareExchange( (volatile LONG 
> > > *)&p_desc->p_endpt->conn.state,
> > > -
> > > IPOIB_CM_CONNECT, IPOIB_CM_DISCONNECTED );
> > > -                     switch( cm_state )
> > > -                     {
> > > -                     case IPOIB_CM_DISCONNECTED:
> > > -                                     IPOIB_PRINT(
> > > TRACE_LEVEL_INFORMATION, IPOIB_DBG_INIT,
> > > -                                             ("ARP REPLY pending
> > > Endpt[%p] QPN %#x MAC %02x:%02x:%02x:%02x:%02x:%02x\n",
> > > -                                             p_desc->p_endpt,
> > > -                                             cl_ntoh32(
> > > ipoib_addr_get_qpn( &p_ib_arp->dst_hw )),
> > > -
> > > p_desc->p_endpt->mac.addr[0], p_desc->p_endpt->mac.addr[1],
> > > -
> > > p_desc->p_endpt->mac.addr[2], p_desc->p_endpt->mac.addr[3],
> > > -
> > > p_desc->p_endpt->mac.addr[4], p_desc->p_endpt->mac.addr[5] ) );
> > > -                                     ipoib_addr_set_sid(
> > > &p_desc->p_endpt->conn.service_id,
> > > -
> > > ipoib_addr_get_qpn( &p_ib_arp->dst_hw ) );
> > > -
> > > -                                     ExFreeToNPagedLookasideList(
> > > -
> > > &p_port->buf_mgr.send_buf_list, p_desc->p_buf );
> > > -                                     cl_qlist_insert_tail(
> > > &p_port->send_mgr.pending_list,
> > > -
> > > IPOIB_LIST_ITEM_FROM_PACKET( p_desc->p_pkt ) );
> > > -                                     
> NdisInterlockedInsertTailList(
> > > &p_port->endpt_mgr.pending_conns,
> > > -
> > > &p_desc->p_endpt->list_item,
> > > -
> > > &p_port->endpt_mgr.conn_lock );
> > > -                                     cl_event_signal(
> > > &p_port->endpt_mgr.event );
> > 
> > Here we add endpoint to the connecting queue and signal cm 
> management 
> > thread to process.
> > Please see __endpt_cm_mgr_thread().
> > 
> > > -                                     return NDIS_STATUS_PENDING;
> > > -                    
> > > -                     case IPOIB_CM_CONNECT:
> > > -                             /* queue ARP REP packet until 
> > connected
> > > */
> > > -                                     ExFreeToNPagedLookasideList(
> > > -                                     
> > &p_port->buf_mgr.send_buf_list,
> > > p_desc->p_buf );
> > > -                                     cl_qlist_insert_tail(
> > > &p_port->send_mgr.pending_list,
> > > -
> > > IPOIB_LIST_ITEM_FROM_PACKET( p_desc->p_pkt ) );
> > > -                                     return NDIS_STATUS_PENDING;
> > > -                     default:
> > > -                             break;
> > > -                     }
> > > -             }
> > >       }
> > >       else
> > >       {
> > >               cl_memclr( &p_ib_arp->dst_hw,
> > sizeof(ipoib_hw_addr_t) );
> > >       }
> > > -
> > > -#if DBG
> > > -     if( p_port->p_adapter->params.cm_enabled )
> > > -     {
> > > -             IPOIB_PRINT( TRACE_LEVEL_INFORMATION, 
> IPOIB_DBG_INIT,
> > > -             (" ARP SEND to ENDPT[%p] State: %d flag: 
> %#x, QPN: %#x
> > > MAC %02x:%02x:%02x:%02x:%02x:%02x\n",
> > > -                     p_desc->p_endpt,
> > > -                     endpt_cm_get_state( p_desc->p_endpt ),
> > > -                     p_desc->p_endpt->cm_flag,
> > > -                     cl_ntoh32( ipoib_addr_get_qpn( 
> > &p_ib_arp->dst_hw
> > > )),
> > > -                     p_desc->p_endpt->mac.addr[0],
> > > p_desc->p_endpt->mac.addr[1],
> > > -                     p_desc->p_endpt->mac.addr[2],
> > > p_desc->p_endpt->mac.addr[3],
> > > -                     p_desc->p_endpt->mac.addr[4],
> > > p_desc->p_endpt->mac.addr[5] ));
> > > -     }
> > > -#endif
> > > -
> > > +    
> > >       p_ib_arp->dst_ip = p_arp->dst_ip;
> > > 
> > >       p_desc->send_wr[0].local_ds[1].vaddr =
> > cl_get_physaddr( p_ib_arp
> > );
> > >
> > 
> > 
> > 
> 


More information about the ofw mailing list