[ewg] Re: [OMPI devel] OMPI over ofed udapl - bugs opened

Steve Wise swise at opengridcomputing.com
Wed May 9 15:15:15 PDT 2007


On Wed, 2007-05-09 at 17:55 -0700, Andrew Friedley wrote:
> 
> Steve Wise wrote:
> > On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote:
> >> Steve Wise wrote:
> >>> There have been a series of discussions on the ofa general list about
> >>> this issue, and the conclusion to date is that it cannot be resolved in
> >>> the rdma-cm or iwarp-cm code of the linux rdma stack.  Mainly because
> >>> sending an RDMA message involves the ULP's work queue and completion
> >>> queue, so the CM cannot do this under the covers in a mannor that
> >>> doesn't affect the application.  Thus, the applications must deal with
> >>> this.
> >> Why can't uDAPL deal with this?  As a uDAPL user, I really don't care 
> >> what API uDAPL is using under the hood to move data from one place to 
> >> another, nor the quirks of that API.  The whole point of uDAPL is to 
> >> form a network-agnostic abstraction layer.  AFAIK, the uDAPL spec 
> >> doesn't enforce any such requirement on RDMA communication either.  In 
> >> my opinion, exposing such behavior above uDAPL is incorrect and is part 
> >> of why uDAPL has seen limited adoption -- every single uDAPL 
> >> implementation behaves in different ways, making it extremely difficult 
> >> to write an application to work on any uDAPL implementation.  Sorry if 
> >> this sounds harsh, but this comes from many hours of banging my head on 
> >> the wall due to working around these sorts of problems :)
> >>
> > 
> > I understand your frustration.  I think the MPA protocol is deficient in
> > this respect and should have required the necessary "first FPDU" to be
> > sent under the covers by the RNICs. A RTR packet if you will.  To
> > resolve this issue "properly", in my opinion, would involve changing the
> > IETF MPA spec and also breaking all the existing iwarp HW.  We can't do
> > that.
> 
> Understood.
> 
> > The reason it is hard or impossible to solve this in the DAPL layer is
> > that any rdma operation on the QP affects the state of that QP and the
> > associate CQs.  In addition, if you use an RDMA send to enforce this you
> > impact the other side by consuming a RECV buffer. So its hard if not
> > impossible to do this under the covers without affecting the
> > application's resources.
> 
> Is there no way to do this before passing connection established events 
> to the uDAPL consumer?  I need to go read up on the uDAPL API to really 
> understand why this wouldn't work.
> 

Perhaps the dapl or maybe even a OFA iWARP CM could defer passing up the
"established" event on the passive side until an incoming SEND is
detected.  I know we've discussed this before, but I'm not sure why this
was not a workable solution.  Perhaps Caitlin or some iwarp folks can
recall?  

> > 
> > Also, the DAPL specification had a goal to not impose any additional
> > protocol on the wire.  If you add this under the covers, then you add
> > such a "protocol" and break interoperability between a connection
> > accessed via DAPL on one end and some other API on the other end.
> 
> So I guess there's no 'right' solution, at least at the uDAPL level. 
> With RDMACM/OFA verbs, there's at least the argument that you can design 
> the API/semantics however you please, while uDAPL is already standardized.

Yes, but its still difficult to post a SEND under the covers because it
consumes the application resources in the form of QP and CQ space and a
RECV buffer.

So to date, we have...punted and pushed to problem to the ULP.

> 
> I hope you guys are documenting this in a way that makes this issue 
> extremely clear to both uDAPL and OFA verbs (is this the right naming?) 
> users.  Maybe it's been done already, but is it possible to emit some 
> sort of loud warning/error when the accept()'ing side tries to send 
> before a receive?
> 

The connection comes tumbling down.  How's that for loud? :)

Seriously though, it isn't documented well enough.  But we're bleeding
edge here. And I'm still hoping somebody will come up with an elegant
solution that doesn't break interoperability, applications and/or iwarp
hw (i'm a dreamer :). 


Steve.







More information about the ewg mailing list