[dat-discussions] [openib-general] [RFC] DAT 2.0 immediatedataproposal

Mon Feb 6 16:49:47 PST 2006

I am not clear what you are proposing?
A transport specific API?

The current proposal provides on sending side:
single post, and single completion in the error free case.
This is commonality that simplify ULP.

Arkady

Arkady Kanevsky                       email: arkady at netapp.com
Network Appliance Inc.               phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195
Waltham, MA 02451                   central phone: 781-768-5300

> -----Original Message-----
> From: Larsen, Roy K [mailto:roy.k.larsen at intel.com] 
> Sent: Monday, February 06, 2006 6:50 PM
> To: Kanevsky, Arkady; Caitlin Bestler; 
> dat-discussions at yahoogroups.com; Sean Hefty
> Cc: openib-general at openib.org
> Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 
> immediatedataproposal
> 
> 
> 
> >From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com]
> >Sent: Monday, February 06, 2006 2:27 PM
> >
> >Roy,
> >comments inline.
> >
> 
> Mine too....
> 
> >>
> >> >From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com]
> >> >Roy,
> >> >Can you explain, please?
> >> >
> >> >For IB the operation will be layered properly on Transport
> primitive.
> >> >And on Recv side it will indicate in completion event DTO that it 
> >> >matches RDMA Write with Immediate and that Immediate Data is
> >> in event.
> >> >
> >> >For iWARP I expect initially, it will be layered on RDMA
> >> Write followed
> >> >by Send. The Provider can do post more efficiently than 
> Consumer and 
> >> >guarantee atomicity.
> >> >On Recv side Consumer will get Recv DTO completion in event and 
> >> >Immediate Data inline as specified by Provider Attribute.
> >> >
> >> >From the performance point of view Consumers who program 
> to IB only 
> >> >will have no performance degradation at all. But this API
> >> also allows
> >> >Consumers to write ULP to be transport independent with minimal
> >> >penalty: one binary comparison and extra 4 bytes in recv buffer.
> >>
> >> If the application could be written transport 
> independently, I would 
> >> have no objection at all.  Instead, it must be written in a 
> >> transport-adaptive way and to be able to adapt to all possible 
> >> implementations, the application could not send arbitrary 
> >> "immediate"-sized data as messages because there is no way to 
> >> distinguish between them on the receiving side.  That is 
> HUGE!  It is 
> >> my experience that send/receive is generally used for 
> small messages 
> >> and to take away particular message sizes or to depend on 
> the so the 
> >> application can "adapt" to whatever the immediate size is for a 
> >> particular transport, if even needed, is a very weak facility to 
> >> offer.
> >
> >But the remote side does posts Recv. Since it anticipate 
> that this Recv 
> >will be matched against the RDMA Write with immediate it 
> posts the recv 
> >buffer which fits. Yes, there is an issue for 
> Transport-independent ULP 
> >that it does needs a buffer.
> >For IB it is possible to post 0-size buffer. But if this is the case 
> >Recv end Consumer DOES know that it will be macthed against 
> RDMA Write 
> >so ULP DOES know what it will be matched against.
> >So in the worst case Consumer does have to pay the price of creating 
> >LMR to handle 4 byte buffer to match RDMA Write Immediate data.
> 
> I think you missed my larger point.  The point was that the 
> application must be written in such a way that it could 
> inferred when immediate data arrived for a variety of 
> immediate data sizes and that places a constraint on the 
> application wrt to data it may want to send/receive normally. 
> Where as, if the application embraced the fact that it was 
> responsible for sending a message to indicate a write 
> completion, it is free to send whatever amount of data best 
> met its needs.
> 
> Transports that support true immediate data do not require 
> the ULP to perform buffer matching.  They can post a series 
> of receive buffers that may or may not indicate immediate 
> data.  The ULP does not have to know ahead of time when 
> immediate data will arrive **against other data receives**.  
> The fact that an IB oriented application never needs to back 
> a receive request with a buffer if they were only used to 
> indicate immediate data is orthogonal.
> 
> >
> >>
> >> It also affects interface resource allocation.  Send queue 
> sizes will 
> >> have to adapt to possibly twice there size.
> >>
> >
> >That is correct. We argued about it at the meeting.
> >One alternative is to have EP and EVD attr. But this will not be 
> >efficient since it will double the queue size where a 
> smaller increment 
> >is possible due to the depth of the RDMA Write pipeline outstanding.
> >
> >> It just dawned on me that the immediate data must be in registered 
> >> memory to be sent in a message.  This means the API must 
> be amended 
> >> to pass an LMR or, even worse, the provider would have to register 
> >> memory in the speed path or create and manipulate its own queue of 
> >> "immediate"
> >> data buffers/LMRs.  Of course, LMRs are not needed and an overhead 
> >> for transports that provide true immediate data.
> >
> >No registration on the speed path. It is Consumer responsibility to 
> >provide Recv Buffer of the right size.
> >Yes for IB only ULP this can be avoided.
> >But ULP can be written to the proposed API to take full 
> advantage of IB 
> >performance but that code will not be transport independent.
> 
> I was referring to the sending side.  Source data of a 
> message send must be from registered memory.  For transports 
> that will emulate this service with a write/send sequence, 
> user specified immediate data will need to be copied to a 
> provider managed pool of "immediate" data buffers/LMRs or the 
> interface changed to specify an LMR.
> 
> >
> >But this API allows to write transport independent code albeit with 
> >certain price attached.
> >
> >>
> >> Oh, and another thing.  InfiniBand indicates the size of the RDMA 
> >> write in the receive completion.  That is something that 
> will have to 
> >> be addressed in a "transport independent" way or dropped 
> as part of 
> >> the service.
> >
> >Good point. I will augment Spec accordingly.
> >
> >>
> >> The bottom line here is that it is NOT transport independent.
> >
> >implementation is not transport independent.
> >But API allows to write Transport-specific ULP with full 
> perfromance as 
> >well Transport-independent ULP with better performance than without 
> >proposed API and with "minimal" performance penalty for 
> Transports that 
> >provide it.
> 
> Of course, you can make the application as transport service 
> adaptive as you want but that is a weak argument and a 
> slippery slop.  My point is that the operational semantics of 
> non-native immediate data transports are identical to 
> write/send in all respects.  So, embrace this and just give 
> the ULP a simple interface that has broader applicability for 
> all transports. Provide a thread atomic combined request 
> capability which can be used for write completion 
> notification (if not natively
> supported) or any other purpose an application may fancy.
> 
> >
> >>
> >> Now, the atomicity argument between write and send has some 
> >> credibility.
> >> If an application chooses to "adapt" to an explicit write/send 
> >> semantic for write completion notification in environments 
> that can't 
> >> provide it natively, this could be addressed by a generalized 
> >> combined request API that can guarantee thread-based 
> atomicity to the 
> >> send queue.  This seems much more straightforward to me since, in 
> >> essence, to adapt to non-native immediate data services, 
> they would 
> >> have to allocate resources and behave in virtually the 
> same way as if 
> >> they did write/send explicitly.
> >>
> >> It is obvious that the proposed service is not one of 
> immediate data 
> >> in the sense defined by InfiniBand.  Since true immediate 
> data is a 
> >> transport specific speed path service, it needs to be 
> implemented as 
> >> a transport specific extension.  To allow an application 
> to initiate 
> >> multiple request sequences that must be queued sequentially to 
> >> explicitly create a write completion notification or any other 
> >> order-based sequence, a generalized combined request API should be 
> >> defined.
> >
> >
> >No disagreemnt here. We were debating a generic way to 
> combine multiple 
> >DTOs into a single call for some time.
> >But how to define a generic way to do it and to have a single
> completion
> >on both ends of the connection in successful case was always 
> a problem.
> 
> I would think an array of pointers and a count to standard 
> work requests would do it.  And of course, each work request 
> can control whether is solicits a completion so a write/send 
> sequence can generate a single completion event on both ends. 
>  Use the EVD lock to guard against other threads injecting 
> requests on the queue during a combined request operation and 
> the ULP has everything it needs.
> 
> Roy
> 
> >
> >>
> >> >
> >> >Arkady Kanevsky                       email: arkady at netapp.com
> >> >Network Appliance Inc.               phone: 781-768-5395
> >> >1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195
> >> >Waltham, MA 02451                   central phone: 781-768-5300
> >> >
> >> >
> >>
>