[ewg] Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened

Jeff Squyres jsquyres at cisco.com
Wed May 9 14:22:44 PDT 2007


I talked with Steve a bunch on the phone about this.

1. This "connector must RDMA first" issue is an iWARP restriction --  
it's not specific to udapl or verbs.  For example, if you try to use  
udapl with iWARP on Solaris, you'll have the same issue (I have no  
idea whether you have iWARP drivers in Solaris or not).

2. Per his prior e-mail (which I didn't fully grok until I talked to  
him), using the RDMA CM in the openib BTL will not magically fix this  
issue for us.

3. So for any of the BTLs to support iWARP -- regardless of  
underlying protocol or OS -- they are going to have to obey this  
restriction.

4. Luckily, in iWARP, the restriction can be met by either send/ 
receive semantics *or* RDMA semantics.  You don't have to  
specifically use RDMA verbs semantics, for example.  This is good  
because of the way that OMPI works (the first fragment that will be  
transmitted is pretty much guaranteed to be a send/receive fragment,  
not an RDMA fragment) -- it makes the logistics slightly simpler.

Galen Shipman and I talked about this a bit and suggest the following:

- During the connection dance (probably for both the udapl and openib  
BTLs), whichever peer ends up being the connection initiator (don't  
forget about the race condition where 2 peers may simultaneously  
decide to initiate -- this case is handled properly in the OMPI code;  
but just make sure you modify the side that ends up being actual  
initiator), they can send their pending fragment immediately (and  
Steve is right that there will always be a pending fragment, because  
OMPI doesn't make a connection until the first send).

- The other peer (the receiver of the connection) must wait to send  
its pending fragment(s) until it receives the first frag from the  
connection initiator.  This can be accomplished either with another  
flag on the OMPI module struct or perhaps making it part of the  
connection protocol (i.e., don't transition the endpoint to be  
CONNECTED until the first fragment is received).  Either of which can  
be used to queue up fragments on the receiver until the first  
fragment is received from the initiator.  I'd have to look in the  
code deeper, but I'm *guessing* that it might be best to use the  
already-existing state flag (i.e., checking for CONNECTED) because  
then you won't be introducing any more conditionals in the critical  
path.




On May 9, 2007, at 4:45 PM, Donald Kerr wrote:

> I guess I have not read enough about iwarp yet but if iwarp is sitting
> below ib verbs or udapl in the stack and is trying to impose
> restrictions which ib verbs or udapl do not adhere to then maybe iwarp
> is in the wrong place in the ofed stack.
>
> Having said that I do agree the OMPI community needs to consider where
> iwarp plays in its own stack. If it has not already.
>
> Steve Wise wrote:
>
>> On Wed, 2007-05-09 at 16:27 -0400, Donald Kerr wrote:
>>
>>
>>> So then I agree with Andrew, I think you are trying to impose
>>> restrictions on uDAPL which are not part of the Spec.
>>>
>>>
>>>
>>
>> true, but if you want a single btl for IB and IW, then you'll need to
>> address this issue in some way...
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel at open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
> _______________________________________________
> devel mailing list
> devel at open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
Cisco Systems




More information about the ewg mailing list