[openib-general] IPoIB connected mode review comments

Sun Feb 4 04:58:20 PST 2007

> Quoting Steve Wise <swise at opengridcomputing.com>:
> Subject: IPoIB connected mode review comments
> 
> On Thu, 2007-02-01 at 20:48 -0800, Roland Dreier wrote:
> >  > Have you had a chance to review this?
> > 
> > Still on my list.
> > 
> > Can we trade?  Can you look at the IPoIB connected mode stuff in the
> > ipoib-cm branch in
> > 
> >     git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git
> > 
> > and let me know if you see anything you don't like?
> > 
> >  - R.
> 
> Here are my comments.  I'm not an ib cm expert though.  These are mostly
> questions:

Steve, thanks for looking at the code!
I hope the following answers your questions.

> 
> Since IPoIB is using IP addresses already, wouldn't it be simpler to use
> the rdma cm to setup connections?  

IPoIB is not using IP addresses. It uses hardware addresses as any network
device would. So using rdma cm does not make sense.

> Could you optimize this design and only signal some of the tx wrs?

This optimization would apply to UD mode too.
No one so far came up with a way to do this cleanly.

> In ipoib_cm_send() you call ipoib_cm_skb_too_long() if the packet is too
> large for the interface mtu.  And you print a warning.  But
> ipoib_cm_skb_too_long() actually queues the packet for the cm case.  For
> ud it just drops the packet.  The skb task for cm then will send a
> ICMP_DEST_UNREACH for these packets.  Why the difference?

For UD I just kept the current behaviour - I think
this can actually only happen in case of a race when packet was queued
before MTU was changed, so the originator was already notified of
the MTU change by the stack above us.

For CM the local MTU may exceed the size of a buffer that was posted on
the remote QP. So we need to send ICMP_DEST_UNREACH to reduce the
originator's dest MTU to whatever this QP actually can support.
Since this needs the original skb, and must be done from task or bh context,
so we queue the skb and handle it in task context.

> Also if this
> packet came from the local stack via a local application, you don't want
> to send  DEST_UNREACH, right?  (I'm probably just confused about the
> purpose of this).

Yes, sending DEST_UNREACH does not seem to affect local interface.  That's why
I call update_pmtu too.  It is also good to update the MTU ASAP to reduce the
number lot of packets that are dropped - and update_pmtu can be called from
atomic context. I do not know how to tell the packet is from local
stack and it does not seem to do any harm to handle all packets in a uniform
manner.

net/ipv4/ip_gre.c and net/ipv4/ipip.c are examples of code that do something
similiar.

> In ipoib_cm_tx_completion() you rearm, then drain the cq.  I thought
> there was some reason that it was better to do drain/rearm/drain?
> Something about if you rearm and there's a cq entry mthca does another
> immediate interrupt?  

Again, this comment applies to UD mode as well.
AFAIK so far this worked best.

> In ipoib_cm_handle_tx_wc():
> 
> When can a tx completion happen with a wr_id that isn't within the
> ipoib_sendq_size range?  This looks like it is really a bug condition
> that should never happen.

Because of this:
	post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1))
so wr_id is always within range.
Again, this is exactly the same logic as in UD case.

> I see the same code in the rx completion path too.  

It's even simpler there:
+       for (i = 0; i < ipoib_recvq_size; ++i) {

...

+               if (ipoib_cm_post_receive(dev, i)) {

...

+               }
+       }

So i is always within RX size range.

> Also, what's up with the /* FIXME */ comment?

Since I have QPs which I never post send WRs on, I should be able to set
.cap.max_send_wr to 0 and .cap.max_send_sge should not matter.

However, low level drivers do not seem to support this at the moment, so
I set these to 1 for now - this is also correct but has a small memory cost. 

> You lock the priv->lock inside of the priv->tx_lock.  Is this ordering
> correct and consistent across all the code?

Yes, that's the nesting rule.

> ipoib_cm_handle_rx_wc() - what's up with the XXX comment?

We have the same comment in UD code - that's where this comes from.
Basically we don't have an easy way to know the correct packet type,
and always setting it to PACKET_HOST seems to work.

> What's the algorithm to keep enough buffers posted in the SRQ?

Same as with UD really - if I can't allocate a new skb I repost
the old one and increment the dropped packet counter.

-- 
MST