<html>

<body>

<font size=3><br>

As one of the authors of IB and iWARP, I can say that both Roland and

Todd's responses are correct and the intent of the specifications. 

The number of outstanding RDMA Reads are bounded and that is communicated

during session establishment.  The ULP can choose to be aware of

this requirement (certainly when we wrote iSER and DA we were well aware

of the requirement and we documented as such in the ULP specs) and track

from above so that it does not see a stall or it can stay ignorant and

deal with the stall as a result.  This is a ULP choice and has been

intentionally done that way so that the hardware can be kept as simple as

possible and as low cost as well while meeting the breadth of ULP needs

that were used to develop these technologies.   <br><br>

Tom, you raised this issue during iWARP's definition and the debate was

conducted at least several times.  The outcome of these debates is

reflected in iWARP and remains aligned with IB.  So, unless you

really want to have the IETF and IBTA go and modify their specs, I

believe you'll have to deal with the issue just as other ULP are doing

today and be aware of the constraint and write the software

accordingly.  The open source community isn't really the right forum

to change iWARP and IB specifications at the end of the day.  Build

a case in the IETF and IBTA and let those bodies determine whether it is

appropriate to modify their specs or not.  And yes, it is

modification of the specs and therefore the hardware implementations as

well address any interoperability requirements that would result (the

change proposed could fragment the hardware offerings as there are many

thousands of devices in the market that would not necessarily support

this change).<br><br>

Mike<br><br>

<br><br>

<br>

At 12:07 PM 6/6/2006, Talpey, Thomas wrote:<br>

<blockquote type=cite class=cite cite="">Todd, thanks for the set-up. I'm

really glad we're having this discussion!<br><br>

Let me give an NFS/RDMA example to illustrate why this upper layer,<br>

at least, doesn't want the HCA doing its flow control, or resource<br>

management.<br><br>

NFS/RDMA is a credit-based protocol which allows many operations in<br>

progress at the server. Let's say the client is currently running

with<br>

an RPC slot table of 100 requests (a typical value).<br>

<br>

Of these requests, some workload-specific percentage will be reads,<br>

writes, or metadata. All NFS operations consist of one send from<br>

client to server, some number of RDMA writes (for NFS reads) or<br>

RDMA reads (for NFS writes), then terminated with one send from<br>

server to client.<br><br>

The number of RDMA read or write operations per NFS op depends<br>

on the amount of data being read or written, and also the memory<br>

registration strategy in use on the client. The highest-performing<br>

such strategy is an all-physical one, which results in one RDMA-able<br>

segment per physical page. NFS r/w requests are, by default, 32KB,<br>

or 8 pages typical. So, typically 8 RDMA requests (read or write)

are<br>

the result.<br><br>

To illustrate, let's say the client is processing a multi-threaded<br>

workload, with (say) 50% reads, 20% writes, and 30% metadata<br>

such as lookup and getattr. A kernel build, for example. Therefore,<br>

of our 100 active operations, 50 are reads for 32KB each, 20 are<br>

writes of 32KB, and 30 are metadata (non-RDMA). <br><br>

To the server, this results in 100 requests, 100 replies, 400 RDMA<br>

writes, and 160 RDMA Reads. Of course, these overlap heavily due<br>

to the widely differing latency of each op and the highly

distributed<br>

arrival times. But, for the example this is a snapshot of current

load.<br><br>

The latency of the metadata operations is quite low, because lookup<br>

and getattr are acting on what is effectively cached data. The reads<br>

and writes however, are much longer, because they reference the<br>

filesystem. When disk queues are deep, they can take many ms.<br><br>

Imagine what happens if the client's IRD is 4 and the server ignores<br>

its local ORD. As soon as a write begins execution, the server posts<br>

8 RDMA Reads to fetch the client's write data. The first 4 RDMA

Reads<br>

are sent, the fifth stalls, and stalls the send queue! Even when

three<br>

RDMA Reads complete, the queue remains stalled, it doesn't unblock<br>

until the fourth is done and all the RDMA Reads have been

initiated.<br><br>

But, what just happened to all the other server send traffic? All

those<br>

metadata replies, and other reads which completed? They're stuck,<br>

waiting for that one write request. In my example, these number 99

NFS<br>

ops, i.e. 654 WRs! All for one NFS write! The client operation

stream<br>

effectively became single threaded. What good is the "rapid

initiation<br>

of RDMA Reads" you describe in the face of this?<br><br>

Yes, there are many arcane and resource-intensive ways around it.<br>

But the simplest by far is to count the RDMA Reads outstanding, and<br>

for the *upper layer* to honor ORD, not the HCA. Then, the send

queue<br>

never blocks, and the operation streams never loses parallelism.

This<br>

is what our NFS server does.<br><br>

As to the depth of IRD, this is a different calculation, it's a

DelayxBandwidth<br>

of the RDMA Read stream. 4 is good for local, low latency

connections.<br>

But over a complicated switch infrastructure, or heaven forbid a dark

fiber<br>

long link, I guarantee it will cause a bottleneck. This isn't an issue

except<br>

for operations that care, but it is certainly detectable. I would like to

see<br>

if a pure RDMA Read stream can fully utilize a typical IB fabric, and

how<br>

much headroom an IRD of 4 provides. Not much, I predict.<br><br>

Closing the connection if IRD is "insufficient to meet goals"

isn't a good<br>

answer, IMO. How does that benefit interoperability? <br><br>

Thanks for the opportunity to spout off again. Comments welcome!<br><br>

Tom.<br><br>

At 12:43 PM 6/6/2006, Rimmer, Todd wrote:<br>

><br>

><br>

>> Talpey, Thomas<br>

>> Sent: Tuesday, June 06, 2006 10:49 AM<br>

>> <br>

>> At 10:40 AM 6/6/2006, Roland Dreier wrote:<br>

>> >    Thomas> This is the difference between

"may" and "must". The<br>

>value<br>

>> >    Thomas> is provided, but I don't see

anything in the spec that<br>

>> >    Thomas> makes a requirement on its

enforcement. Table 107 says<br>

>the<br>

>> >    Thomas> consumer can query it, that's

about as close as it<br>

>> >    Thomas> comes. There's some discussion

about CM exchange too.<br>

>> ><br>

>> >This seems like a very strained interpretation of the

spec.  For<br>

>> <br>

>> I don't see how strained has anything to do with it. It's not

saying<br>

>> anything<br>

>> either way. So, a legal implementation can make either choice.

We're<br>

>> talking about the spec!<br>

>> <br>

>> But, it really doesn't matter. The point is, an upper layer

should be<br>

>> paying<br>

>> attention to the number of RDMA Reads it posts, or else suffer

either<br>

>the<br>

>> queue-stalling or connection-failing consequences. Bad stuff

either<br>

>way.<br>

>> <br>

>> Tom.<br>

><br>

>Somewhere beneath this discussion is a bug in the application or

IB<br>

>stack.  I'm not sure which "may" in the spec you are

referring to, but<br>

>the "may"s I have found all are for cases where the

responder might<br>

>support only 1 outstanding request.  In all cases the

negotiation<br>

>protocol must be followed and the requestor is not allowed to exceed

the<br>

>negotiated limit.<br>

><br>

>The mechanism should be:<br>

>client queries its local HCA and determines responder resources

(eg.<br>

>number of concurrent outstanding RDMA reads on the wire from the

remote<br>

>end where this end will respond with the read data) and initiator

depth<br>

>(eg. number of concurrent outstanding RDMA reads which this end

can<br>

>initiate as the requestor).<br>

><br>

>client puts the above information in the CM REQ.<br>

><br>

>server similarly gets its information from its local CA and

negotiates<br>

>down the values to the MIN of each side (REP.InitiatorDepth =<br>

>MIN(REQ.ResponderResources, server's local CAs Initiator depth);<br>

>REP.ResponderResources = MIN(REQ.InitiatorDepth, server's local

CAs<br>

>responder resources).  If server does not support RDMA Reads, it

can<br>

>REJ.<br>

><br>

>If client decided the negotiated values are insufficient to meet

its<br>

>goals, it can disconnect.<br>

><br>

>Each side sets its QP parameters via modify QP appropriately. 

Note they<br>

>too will be mirror images of eachother:<br>

>client:<br>

>QP.Max RDMA Reads as Initiator = REP.ResponderResources<br>

>QP.Max RDMA reads as responder = REP.InitiatorDepth<br>

><br>

>server:<br>

>QP.Max RDMA Reads as responder = REP.ResponderResources<br>

>QP.Max RDMA reads as initiator = REP.InitiatorDepth<br>

><br>

>We have done a lot of high stress RDMA Read traffic with Mellanox

HCAs<br>

>and provided the above negotiation is followed, we have seen no

issues.<br>

>Note however that by default a Mellanox HCA typically reports a

large<br>

>InitiatorDepth (128) and a modest ResponderResources (4-8). 

Hence when<br>

>I hear that Responder Resources must be grown to 128 for some<br>

>application to reliably work, it implies the negotiation I

outlined<br>

>above is not being followed.<br>

><br>

>Note that the ordering rules in table 76 of IBTA 1.2 show how reads

and<br>

>write on a send queue are ordered.  There are many cases where

an op can<br>

>pass an outstanding RDMA read, hence it is not always bad to queue

extra<br>

>RDMA reads.  If needed, the Fence can be sent to force

order.<br>

><br>

>For many apps, its going to be better to get the items onto queue

and<br>

>let the QP handle the outstanding reads cases rather than have the

app<br>

>add a level of queuing for this purpose.  Letting the HCA do the

queuing<br>

>will allow for a more rapid initiation of subsequent reads.<br>

><br>

>Todd Rimmer<br><br>

<br>

_______________________________________________<br>

openib-general mailing list<br>

openib-general@openib.org<br>

<a href="http://openib.org/mailman/listinfo/openib-general" eudora="autourl">

http://openib.org/mailman/listinfo/openib-general</a><br><br>

To unsubscribe, please visit

<a href="http://openib.org/mailman/listinfo/openib-general" eudora="autourl">

http://openib.org/mailman/listinfo/openib-general</a>

</font></blockquote></body>

</html>