[openib-general] Re: Mellanox HCAs: outstanding RDMAs
Rimmer, Todd
trimmer at silverstorm.com
Tue Jun 6 09:43:23 PDT 2006
> Talpey, Thomas
> Sent: Tuesday, June 06, 2006 10:49 AM
>
> At 10:40 AM 6/6/2006, Roland Dreier wrote:
> > Thomas> This is the difference between "may" and "must". The
value
> > Thomas> is provided, but I don't see anything in the spec that
> > Thomas> makes a requirement on its enforcement. Table 107 says
the
> > Thomas> consumer can query it, that's about as close as it
> > Thomas> comes. There's some discussion about CM exchange too.
> >
> >This seems like a very strained interpretation of the spec. For
>
> I don't see how strained has anything to do with it. It's not saying
> anything
> either way. So, a legal implementation can make either choice. We're
> talking about the spec!
>
> But, it really doesn't matter. The point is, an upper layer should be
> paying
> attention to the number of RDMA Reads it posts, or else suffer either
the
> queue-stalling or connection-failing consequences. Bad stuff either
way.
>
> Tom.
Somewhere beneath this discussion is a bug in the application or IB
stack. I'm not sure which "may" in the spec you are referring to, but
the "may"s I have found all are for cases where the responder might
support only 1 outstanding request. In all cases the negotiation
protocol must be followed and the requestor is not allowed to exceed the
negotiated limit.
The mechanism should be:
client queries its local HCA and determines responder resources (eg.
number of concurrent outstanding RDMA reads on the wire from the remote
end where this end will respond with the read data) and initiator depth
(eg. number of concurrent outstanding RDMA reads which this end can
initiate as the requestor).
client puts the above information in the CM REQ.
server similarly gets its information from its local CA and negotiates
down the values to the MIN of each side (REP.InitiatorDepth =
MIN(REQ.ResponderResources, server's local CAs Initiator depth);
REP.ResponderResources = MIN(REQ.InitiatorDepth, server's local CAs
responder resources). If server does not support RDMA Reads, it can
REJ.
If client decided the negotiated values are insufficient to meet its
goals, it can disconnect.
Each side sets its QP parameters via modify QP appropriately. Note they
too will be mirror images of eachother:
client:
QP.Max RDMA Reads as Initiator = REP.ResponderResources
QP.Max RDMA reads as responder = REP.InitiatorDepth
server:
QP.Max RDMA Reads as responder = REP.ResponderResources
QP.Max RDMA reads as initiator = REP.InitiatorDepth
We have done a lot of high stress RDMA Read traffic with Mellanox HCAs
and provided the above negotiation is followed, we have seen no issues.
Note however that by default a Mellanox HCA typically reports a large
InitiatorDepth (128) and a modest ResponderResources (4-8). Hence when
I hear that Responder Resources must be grown to 128 for some
application to reliably work, it implies the negotiation I outlined
above is not being followed.
Note that the ordering rules in table 76 of IBTA 1.2 show how reads and
write on a send queue are ordered. There are many cases where an op can
pass an outstanding RDMA read, hence it is not always bad to queue extra
RDMA reads. If needed, the Fence can be sent to force order.
For many apps, its going to be better to get the items onto queue and
let the QP handle the outstanding reads cases rather than have the app
add a level of queuing for this purpose. Letting the HCA do the queuing
will allow for a more rapid initiation of subsequent reads.
Todd Rimmer
More information about the general
mailing list