[dat-discussions] [openib-general] [RFC] DAT 2.0 immediatedataproposal

Kanevsky, Arkady Arkady.Kanevsky at netapp.com
Mon Feb 6 14:26:52 PST 2006


Roy,
comments inline.

Arkady Kanevsky                       email: arkady at netapp.com
Network Appliance Inc.               phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195
Waltham, MA 02451                   central phone: 781-768-5300
 

> -----Original Message-----
> From: Larsen, Roy K [mailto:roy.k.larsen at intel.com] 
> Sent: Monday, February 06, 2006 4:25 PM
> To: Kanevsky, Arkady; Caitlin Bestler; 
> dat-discussions at yahoogroups.com; Sean Hefty
> Cc: openib-general at openib.org
> Subject: RE: [dat-discussions] [openib-general] [RFC] DAT 2.0 
> immediatedataproposal
> 
> 
> 
> >From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com]
> >Roy,
> >Can you explain, please?
> >
> >For IB the operation will be layered properly on Transport primitive.
> >And on Recv side it will indicate in completion event DTO that it 
> >matches RDMA Write with Immediate and that Immediate Data is 
> in event.
> >
> >For iWARP I expect initially, it will be layered on RDMA 
> Write followed 
> >by Send. The Provider can do post more efficiently than Consumer and 
> >guarantee atomicity.
> >On Recv side Consumer will get Recv DTO completion in event and 
> >Immediate Data inline as specified by Provider Attribute.
> >
> >From the performance point of view Consumers who program to IB only 
> >will have no performance degradation at all. But this API 
> also allows 
> >Consumers to write ULP to be transport independent with minimal 
> >penalty: one binary comparison and extra 4 bytes in recv buffer.
> 
> If the application could be written transport independently, 
> I would have no objection at all.  Instead, it must be 
> written in a transport-adaptive way and to be able to adapt 
> to all possible implementations, the application could not 
> send arbitrary "immediate"-sized data as messages because 
> there is no way to distinguish between them on the receiving 
> side.  That is HUGE!  It is my experience that send/receive 
> is generally used for small messages and to take away 
> particular message sizes or to depend on the so the 
> application can "adapt" to whatever the immediate size is for 
> a particular transport, if even needed, is a very weak 
> facility to offer.

But the remote side does posts Recv. Since it anticipate that
this Recv will be matched against the RDMA Write with immediate
it posts the recv buffer which fits. Yes, there is an issue
for Transport-independent ULP that it does needs a buffer.
For IB it is possible to post 0-size buffer. But if this is the case
Recv end Consumer DOES know that it will be macthed against RDMA
Write so ULP DOES know what it will be matched against.
So in the worst case Consumer does have to pay the price of creating
LMR to handle 4 byte buffer to match RDMA Write Immediate data.

> 
> It also affects interface resource allocation.  Send queue 
> sizes will have to adapt to possibly twice there size.
> 

That is correct. We argued about it at the meeting.
One alternative is to have EP and EVD attr. But this will not
be efficient since it will double the queue size where
a smaller increment is possible due to the depth of the RDMA Write
pipeline outstanding.

> It just dawned on me that the immediate data must be in 
> registered memory to be sent in a message.  This means the 
> API must be amended to pass an LMR or, even worse, the 
> provider would have to register memory in the speed path or 
> create and manipulate its own queue of "immediate"
> data buffers/LMRs.  Of course, LMRs are not needed and an 
> overhead for transports that provide true immediate data.

No registration on the speed path. It is Consumer responsibility
to provide Recv Buffer of the right size.
Yes for IB only ULP this can be avoided.
But ULP can be written to the proposed API to take full
advantage of IB performance but that code will not be transport
independent.

But this API allows to write transport independent code
albeit with certain price attached.

> 
> Oh, and another thing.  InfiniBand indicates the size of the 
> RDMA write in the receive completion.  That is something that 
> will have to be addressed in a "transport independent" way or 
> dropped as part of the service.

Good point. I will augment Spec accordingly.

> 
> The bottom line here is that it is NOT transport independent. 

implementation is not transport independent.
But API allows to write Transport-specific ULP with full perfromance
as well Transport-independent ULP with better performance
than without proposed API and with "minimal" performance
penalty for Transports that provide it.

> 
> Now, the atomicity argument between write and send has some 
> credibility.
> If an application chooses to "adapt" to an explicit 
> write/send semantic for write completion notification in 
> environments that can't provide it natively, this could be 
> addressed by a generalized combined request API that can 
> guarantee thread-based atomicity to the send queue.  This 
> seems much more straightforward to me since, in essence, to 
> adapt to non-native immediate data services, they would have 
> to allocate resources and behave in virtually the same way as 
> if they did write/send explicitly. 
> 
> It is obvious that the proposed service is not one of 
> immediate data in the sense defined by InfiniBand.  Since 
> true immediate data is a transport specific speed path 
> service, it needs to be implemented as a transport specific 
> extension.  To allow an application to initiate multiple 
> request sequences that must be queued sequentially to 
> explicitly create a write completion notification or any 
> other order-based sequence, a generalized combined request 
> API should be defined.


No disagreemnt here. We were debating a generic way to combine
multiple DTOs into a single call for some time.
But how to define a generic way to do it and to have a single completion
on both ends of the connection in successful case was always a problem.

> 
> >
> >Arkady Kanevsky                       email: arkady at netapp.com
> >Network Appliance Inc.               phone: 781-768-5395
> >1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195
> >Waltham, MA 02451                   central phone: 781-768-5300
> >
> >
> 



More information about the general mailing list