[openib-general] [PATCH] IB/ipoib: NAPI

Mon Sep 25 09:27:23 PDT 2006

> From: Michael S. Tsirkin
> Sent: Monday, September 25, 2006 10:54 AM
> To: Roland Dreier
> Cc: openib-general at openib.org
> Subject: Re: [openib-general] [PATCH] IB/ipoib: NAPI
> 
> Quoting r. Roland Dreier <rdreier at cisco.com>:
> > Subject: Re: [PATCH] IB/ipoib: NAPI
> >
> >     Michael> Actually, the reason it is hard to come up with the
name
> >     Michael> is that what this enables is the natural poll/request
> >     Michael> notification order.
> >
> > Over the weekend and I thought about this and I came up with an idea
I
> > kind of like, inspired by Todd Rimmer's comments about
poll-and-notify.
> >
> > We could change ib_req_notify_cq() to have an extra parameter:
> >
> > static inline int ib_req_notify_cq(struct ib_cq *cq,
> > 				   enum ib_cq_notify cq_notify,
> > 				   int *lost_event_possible)
> >
> > and if non-NULL is passed in for lost_event_possible, then
> > req_notify_cq should do the equivalent of a CQ peek after arming the
> > CQ event.
> 
> I thought about this too.
> 
> But this has a disadvantage over the device-wide flag: when flag is
> device-wide,
> we can just have 2 polling routines - with and without peek - and
select
> the
> correct one at device open depending on the hardware capabilities.
> Thus we can avoid a conditional branch on the fast path,
> which I think is nice.
> 
> So I think if we want to enable mthca-specific optimization,
> the righ tway is with device flags.
> 
> On a separate note - ib_req_notify_cq is also testing the
> lost_event_possible flag -
> so now we have 2 conditional branches on fast path, and this hurts all
> ULPs. Ugh.
> 
> If we extend the interface, I would rather make  a new call
> 	ib_req_notify_and_peek_cq(truct ib_cq *cq, enum ib_cq_notify
> cq_notify)
> that returns 0 on empty CQ, 1 on non-empty and negative on error.
> 
> --
> MST
> 

Its inefficient to peek the CQ if the next operation is likely to then
be a poll.  Performing the poll_and_notify in one call is more
efficient.

Then if you use poll_and_notify instead of poll_cq in the polling loops,
you can also be equally efficient for all HCA models without needing a
hardware capability flag and 2 polling algorithms in each ULP.  Instead
the HCA driver naturally provides the most efficient approach and all
callers use the same algorithm.

In the examples below, lets assume 2 CQEs are returned, then its rearmed
and CQ is still empty afterward.

For example on Mellanox HCAs the actual sequence would be:
	poll_and_notify
		returns a CQE, tells caller to call it again
	poll_and_notify
		returns a CQE, tells caller to call it again
	poll_and_notify
		finds CQ empty, rearms CQ, tells caller its done [note
no peek needed]
3 Driver calls, 3 CQE access, 1 rearm

For other HCAs the actual sequence would be:
	poll_and_notify
		returns a CQE, tells caller to call it again
	poll_and_notify
		returns a CQE, tells caller to call it again
	poll_and_notify
		finds CQ empty, rearms CQ, peeks CQ
		if CQ empty, tells caller its done [for this example,
its true]
		if CQ not empty, tells caller to loop on poll_cq
3 Driver calls, 4 CQE access, 1 rearm

In comparison the present code (or with a device capability flag) is:
	poll_cq
		returns a CQE
	poll_cq
		returns a CQE
	poll_cq
		finds CQ empty
	notify_cq
		rearms CQ
	if non-Mellanox HCA
		poll_cq - finds CQ empty
4-5 Driver calls, 3-4 CQE access, 1 rearm

With notify with an internal peek (lost event flag approach) its:
	poll_cq
		returns a CQE
	poll_cq
		returns a CQE
	poll_cq
		finds CQ empty
	notify_cq
		rearms CQ, for non-mellanox HCA, peeks CQ - finds CQ
empty
	if lost events indicated [for this example its false]
		poll_cq til empty
4-5 Driver calls, 3-4 CQE access, 1 rearm

Hence for all HCA models, the poll_and_notify approach has fewer driver
calls. (3 in above example, compared to 4 for other approaches).

In general driver calls are going to be the expensive factor in this
comparison.  The main difference in all the above examples will be the
spin_lock for the CQ.

Depending on HCA design, the poll_cq and/or notify_cq and/or peek_cq
operations may also incur an expensive PCI bus read or write.  However,
with the exception of the notify w/peek approach, those costs are the
same for all the above examples.

In the case (not shown above) where there was 1 additional CQE found
after the rearm [applicable only to non-Mellanox HCAs], the
poll_and_notify approach will also save 1 CQE access as compared to the
notify w/internal peek approach.

Todd Rimmer