[openib-general] Re: [PATCH 2/6] [RFC] mthca kernel changes for resizeCQ

Mon Jan 30 10:02:18 PST 2006

openib-general-bounces at openib.org wrote:

> 
> I think I see a problem with this approach: if a ULP performs
> CQ poll while mthca_RESIZE_CQ is in progress, it might get a
> false indication that the CQ is empty since CQEs are being
> written to the new buffer already.
> 
> As a result e.g. a ULP that does:
> 
> arm
> while (poll cq) {
>    handle cqe
> }
> 
> will not empty the CQ and will deadlock.

Allowing resizes of an active CQ is very tricky, and has many
caveats. It is also, in my opinion, something that applications
should not be encouraged to do unless CQ size is truly limited
by on-chip resources.

When the CQ size is not constrained by on-chip resources, but
only by the amount of user-space memory that can be allocated
to the CQ, then the application should be encouraged to size
the CQ to the maximum desired from the very beginning.

I'm not saying that resize should not be supported, just that
the application developer should also have a device attribute
that tells it when resizing is best avoided AND not truly
needed in the first place.

Atomically transitioning a CQ to an alternate buffer with
no potential for a false empty is simply not possible without
putting extra checks in cq_poll() that have to be paid
for on every cq_poll().

What exactly is the device level code *required* or expected
to do? Since different devices have different methods of
interacting between device, verbs and driver it is very
important that the *requirements* and expectations be
stated. Should the verbs be coded to enable CQ resize
even if doing so adds a layer of indirection to the
data structures, needs an extra check, and/or reuqires
an extra lock that would not have been needed on a
fixed size CQ? As I stated above, when the CQ size is
not constraned by on-chip resources, I believe the
correct answer is that cq_poll should be made as
efficient as possible even if that means that cq_resize
will not be supported. Is that what middleware should
assume the devices are doing? Or do we need middleware
to be aware of that there are devices where cq resizing
is needed and optimized as well as those where resizing
is irrelevant and not supported.

What we want to discourage is devices where resize is
needed but not supported, and devices where resize is
not needed but cq_poll is slowed down in order to 
support resizing.