[ewg] Re: [OMPI devel] OMPI over ofed udapl - bugs opened

Donald Kerr Don.Kerr at Sun.COM
Wed May 9 13:20:23 PDT 2007


I missing some context here. Where are you plugging iwarp and OMPI 
together?

Steve Wise wrote:

>On Wed, 2007-05-09 at 11:42 -0400, Donald Kerr wrote:
>  
>
>>I agree OMPI trac ticket #890 should cover this. I will test the 
>>suggested fix, just removing that one line from btl_udapl.c, on Solaris. 
>>I am still not set up on Linux so hopefully Steve can confirm there.
>>
>>    
>>
>
>All,
>
>First, I haven't tested Arlins dat_ep_query() fix yet as we have
>determined its not needed.  The OMPI udapl btl never calls
>dat_ep_query()... 
>
>So running OMPI with the suggested fix (removing the overwriting of the
>hca_addr port field in btl_udapl.c) over ofed udapl on chelsio's iwarp
>rnic still doesn't work.  
>
>There are two new issues so far:
>
>1) this has uncovered a connection migration issue in the Chelsio
>driver/firmware.  We are developing and testing a fix for this now.
>Should be ready tomorrow hopefully.
>
>2) OMPI is not adhering to the iwarp protocol requirement that the ULP,
>in this case OMPI, initiating the iwarp connection (the side issuing the
>dat_ep_connect() or rdma_connect()) _MUST_ be the first to send an RDMA
>message.  So if a OMPI process _accepts_ an rdma connection, then it
>cannot send on that connection until it receives some sort of rdma
>operation from the client process.  It appears the current OMPI
>connection setup model doesn't enforce this.
>
>This combined with the bug above causes an immediate connection failure
>on chelsio's rnic.  After I fix #1 above, things might get slightly
>better but my guess is we will still have connection setup problems if
>the server side sends before the client side finishes streaming->rdma
>mode transition.  
>
>There have been a series of discussions on the ofa general list about
>this issue, and the conclusion to date is that it cannot be resolved in
>the rdma-cm or iwarp-cm code of the linux rdma stack.  Mainly because
>sending an RDMA message involves the ULP's work queue and completion
>queue, so the CM cannot do this under the covers in a mannor that
>doesn't affect the application.  Thus, the applications must deal with
>this.
>
>
>Here is a possible solution: 
>
>I assume in OMPI that connections are only initiated when the mpi
>application does a send operation.   Given that, then udapl btl must
>ensure that if a given rank accepts a connection, it cannot not send
>anything until the rank at the other end of the connection sends first.
>Since the other side initiated the connection, it will have pending data
>to send...
>
>I haven't looked into how painful this will be to implement.
>
>Thoughts?
>
>
>FYI:
>
>IETF Draft requiring this behavior:
>
>http://www.ietf.org/internet-drafts/draft-ietf-rddp-mpa-08.txt
>
>See section 7 for specifics.
>
>Steve.
>
>
>_______________________________________________
>devel mailing list
>devel at open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>  
>



More information about the ewg mailing list