[dat-discussions] [openib-general] [RFC] DAT 2.0 immediatedataproposal

Larsen, Roy K roy.k.larsen at intel.com
Mon Feb 6 15:49:48 PST 2006



>From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com]
>Sent: Monday, February 06, 2006 2:27 PM
>
>Roy,
>comments inline.
>

Mine too....

>>
>> >From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com]
>> >Roy,
>> >Can you explain, please?
>> >
>> >For IB the operation will be layered properly on Transport
primitive.
>> >And on Recv side it will indicate in completion event DTO that it
>> >matches RDMA Write with Immediate and that Immediate Data is
>> in event.
>> >
>> >For iWARP I expect initially, it will be layered on RDMA
>> Write followed
>> >by Send. The Provider can do post more efficiently than Consumer and
>> >guarantee atomicity.
>> >On Recv side Consumer will get Recv DTO completion in event and
>> >Immediate Data inline as specified by Provider Attribute.
>> >
>> >From the performance point of view Consumers who program to IB only
>> >will have no performance degradation at all. But this API
>> also allows
>> >Consumers to write ULP to be transport independent with minimal
>> >penalty: one binary comparison and extra 4 bytes in recv buffer.
>>
>> If the application could be written transport independently,
>> I would have no objection at all.  Instead, it must be
>> written in a transport-adaptive way and to be able to adapt
>> to all possible implementations, the application could not
>> send arbitrary "immediate"-sized data as messages because
>> there is no way to distinguish between them on the receiving
>> side.  That is HUGE!  It is my experience that send/receive
>> is generally used for small messages and to take away
>> particular message sizes or to depend on the so the
>> application can "adapt" to whatever the immediate size is for
>> a particular transport, if even needed, is a very weak
>> facility to offer.
>
>But the remote side does posts Recv. Since it anticipate that
>this Recv will be matched against the RDMA Write with immediate
>it posts the recv buffer which fits. Yes, there is an issue
>for Transport-independent ULP that it does needs a buffer.
>For IB it is possible to post 0-size buffer. But if this is the case
>Recv end Consumer DOES know that it will be macthed against RDMA
>Write so ULP DOES know what it will be matched against.
>So in the worst case Consumer does have to pay the price of creating
>LMR to handle 4 byte buffer to match RDMA Write Immediate data.

I think you missed my larger point.  The point was that the application
must be written in such a way that it could inferred when immediate data
arrived for a variety of immediate data sizes and that places a
constraint on the application wrt to data it may want to send/receive
normally. Where as, if the application embraced the fact that it was
responsible for sending a message to indicate a write completion, it is
free to send whatever amount of data best met its needs.

Transports that support true immediate data do not require the ULP to
perform buffer matching.  They can post a series of receive buffers that
may or may not indicate immediate data.  The ULP does not have to know
ahead of time when immediate data will arrive **against other data
receives**.  The fact that an IB oriented application never needs to
back a receive request with a buffer if they were only used to indicate
immediate data is orthogonal.

>
>>
>> It also affects interface resource allocation.  Send queue
>> sizes will have to adapt to possibly twice there size.
>>
>
>That is correct. We argued about it at the meeting.
>One alternative is to have EP and EVD attr. But this will not
>be efficient since it will double the queue size where
>a smaller increment is possible due to the depth of the RDMA Write
>pipeline outstanding.
>
>> It just dawned on me that the immediate data must be in
>> registered memory to be sent in a message.  This means the
>> API must be amended to pass an LMR or, even worse, the
>> provider would have to register memory in the speed path or
>> create and manipulate its own queue of "immediate"
>> data buffers/LMRs.  Of course, LMRs are not needed and an
>> overhead for transports that provide true immediate data.
>
>No registration on the speed path. It is Consumer responsibility
>to provide Recv Buffer of the right size.
>Yes for IB only ULP this can be avoided.
>But ULP can be written to the proposed API to take full
>advantage of IB performance but that code will not be transport
>independent.

I was referring to the sending side.  Source data of a message send must
be from registered memory.  For transports that will emulate this
service with a write/send sequence, user specified immediate data will
need to be copied to a provider managed pool of "immediate" data
buffers/LMRs or the interface changed to specify an LMR.

>
>But this API allows to write transport independent code
>albeit with certain price attached.
>
>>
>> Oh, and another thing.  InfiniBand indicates the size of the
>> RDMA write in the receive completion.  That is something that
>> will have to be addressed in a "transport independent" way or
>> dropped as part of the service.
>
>Good point. I will augment Spec accordingly.
>
>>
>> The bottom line here is that it is NOT transport independent.
>
>implementation is not transport independent.
>But API allows to write Transport-specific ULP with full perfromance
>as well Transport-independent ULP with better performance
>than without proposed API and with "minimal" performance
>penalty for Transports that provide it.

Of course, you can make the application as transport service adaptive as
you want but that is a weak argument and a slippery slop.  My point is
that the operational semantics of non-native immediate data transports
are identical to write/send in all respects.  So, embrace this and just
give the ULP a simple interface that has broader applicability for all
transports. Provide a thread atomic combined request capability which
can be used for write completion notification (if not natively
supported) or any other purpose an application may fancy.

>
>>
>> Now, the atomicity argument between write and send has some
>> credibility.
>> If an application chooses to "adapt" to an explicit
>> write/send semantic for write completion notification in
>> environments that can't provide it natively, this could be
>> addressed by a generalized combined request API that can
>> guarantee thread-based atomicity to the send queue.  This
>> seems much more straightforward to me since, in essence, to
>> adapt to non-native immediate data services, they would have
>> to allocate resources and behave in virtually the same way as
>> if they did write/send explicitly.
>>
>> It is obvious that the proposed service is not one of
>> immediate data in the sense defined by InfiniBand.  Since
>> true immediate data is a transport specific speed path
>> service, it needs to be implemented as a transport specific
>> extension.  To allow an application to initiate multiple
>> request sequences that must be queued sequentially to
>> explicitly create a write completion notification or any
>> other order-based sequence, a generalized combined request
>> API should be defined.
>
>
>No disagreemnt here. We were debating a generic way to combine
>multiple DTOs into a single call for some time.
>But how to define a generic way to do it and to have a single
completion
>on both ends of the connection in successful case was always a problem.

I would think an array of pointers and a count to standard work requests
would do it.  And of course, each work request can control whether is
solicits a completion so a write/send sequence can generate a single
completion event on both ends.  Use the EVD lock to guard against other
threads injecting requests on the queue during a combined request
operation and the ULP has everything it needs.

Roy

>
>>
>> >
>> >Arkady Kanevsky                       email: arkady at netapp.com
>> >Network Appliance Inc.               phone: 781-768-5395
>> >1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195
>> >Waltham, MA 02451                   central phone: 781-768-5300
>> >
>> >
>>



More information about the general mailing list