[ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V3] patch for review

Wed May 2 17:30:06 PDT 2007

Firstly thanks for the review Michael. My responses/questions below, and 
yes will fix some of the style issues that
you have pointed out. The new functions (and labels) had the srq/nosrq 
suffxes for mainatainability purposes
and also to keep a structure similar to the current IPOIB.

Pradeep
pradeep at us.ibm.com

"Michael S. Tsirkin" <mst at dev.mellanox.co.il> wrote on 05/01/2007 11:46:48 
PM:

> OK, we are making progress (line-wrapping issues aside :). And there 
seems to
> be some whitespace damage, too. Pls take care of this.
> 
> I think the handle_rx_wc split is going in the right direction,
> but let's take this through all the datapath.
> 
> I went over the patch in a bit more depth, and I have some questions:
> 
> > +   for (i = 0; i < ipoib_recvq_size; ++i) {
> > +      if (!ipoib_cm_alloc_rx_skb(dev, i << 32 | index,
> 
> ...
> 
> > +      if (ipoib_cm_post_receive(dev, i << 32 | index)) {
> 
> 1. It seems there are multiple QPs mapped to a single CQ -
>    and each gets ipoib_recvq_size recv WRs above.
>    Is that right? How do you prevent CQ overrun then?

Good point! Looking at the IB spec it appears that upon CQ overflow
it results in a Local Work Queue catastrophic error and puts the QP
(receiver side) in an error state. Hence, I am speculating that the 
sending side will see an error. This will result in the sending side 
destroying the QP and sending a DREQ message which, will remove the 
receive side QP.

A new set of QPs will be created on the send side (this is RC) and
the connection setup starts over again. It will continue, but at a
degraded rate. Is this correct? What other alternative do you suggest
-create a CQ per QP? Is the max number of CQs an issue to consider, if 
we adopt this approach?

> 
> > +   /* Find an empty rx_index_ring[] entry */
> > +   for (index = 0; index < NOSRQ_INDEX_RING_SIZE; index++)
> > +      if (priv->cm.rx_index_ring[index] == NULL)
> > +         break; 
> > +
> > +   if ( index == NOSRQ_INDEX_RING_SIZE) {
> > +      spin_unlock_irq(&priv->lock);
> > +      printk(KERN_WARNING "NOSRQ supports a max of %d RC "
> > +             "QPs. That limit has now been reached\n",
> > +             NOSRQ_INDEX_RING_SIZE);
> > +      return -EINVAL;
> > +   }
> 
> 2. So, when QP limit has been reached, remote side will get
>    a reject with custom reject reason?
>    Is so, it seems that since the remote does not know
>    what the reason for reject is, it'll just retry
>    on the next packet, and again and again. Basically,
>    connectivity is denied where it previously worked fine
>    by falling back on datagram mode?
> 

Good point again! Yes, this would be an apt description. However,
I have few questions (see below)

>    One way to fix this, could be to try and use a reject reason
>    that will tell the remote "I'm busy, switch to datagram mode
>    for a loooooong time". Using path mtu discovery here might be useful
>    to actually have it come back and retry after several minutes.
> 

How does one send a reject reason -through CM? I could unset the bit 
IPOIB_FLAG_ADMIN_CM in flag, but will that not transition all the QPs
to datagram mode. What we need is a mechanism that will let the current
set of QPs be in connected mode, and transition only the new ones to 
datagram mode if connected mode cannot be supported as in this case.
How to do that?

>    *In theory*, we could get this even with SRQ -
>    if the *HCA* starts running out of RC QPs - it is just
>    never happening in practice as current HCAs support #QPs larger
>    than a maximum IB subnet size.
>    So I might post a patch to implement this, stay tuned.

This will be interesting.

> 
> > +   spin_lock_irqsave(&priv->lock, flags);
> > +   rx_ptr = priv->cm.rx_index_ring[index];
> > +   spin_unlock_irqrestore(&priv->lock, flags);
> 
> 3. You never actually test the rx_ptr that you got.
>    So why does locking help?
>    A better way to destroy QPs might be to move it to error state first.

In ipoib_cm_stale_task(): priv->cm.rx_index_ring[p->index] = NULL;
this assignment does happen under lock. All I need to do (in the code 
snippet 
above you point out) is check if rx_pt == NULL, if so drop the packet.
I did think about this one, but never implemented it.

> 
>    We actually need something like this for CM too - stay tuned for a 
patch.
> 
> I also commented on some style issues below.
> 
> > Note 1: I have retained the code to avoid IB_WC_RETRY_EXC_ERR 
> while performing
> > interoperability tests As discussed in this mailing list that may 
> be a CM bug or
> > have the various HCA address it. Hence I would like to seperate 
> out that issue
> > from this patch.
> > At a future point when the issue gets resolved I can provide
> > another patch to change the retry_count values back to 0 if need be.
> 
> The correct way to separate it, in my opinion, is to set retry_count = 
0,
> and (for now) apply a work-around patch at your site before testing.
> We really don't want to paper over this bug, in my opinion.

Ok, will reset this back to 0, but that is not (my) preferred way. If some
one were to pick up the code and try it with retry_count=0, the HCAs will 
not inter-operate as is. Hence the hesitation.

> > 
> >  struct ipoib_cm_tx {
> > @@ -177,6 +185,7 @@ struct ipoib_cm_dev_priv {
> >     struct ib_wc            ibwc[IPOIB_NUM_WC];
> >     struct ib_sge           rx_sge[IPOIB_CM_RX_SG];
> >     struct ib_recv_wr       rx_wr;
> > +   struct ipoib_cm_rx   **rx_index_ring;
> >  };
> > 
> >  /*
> 
> Isn't "ring" a bit of a misnomer?

Yes, this is a misnomer. This is a vestige of an an earlier thing that
I thought of. Will change it to something else more appropriate.

> > +   unsigned long flags;
> > +
> > +   index = id  & NOSRQ_INDEX_MASK ;
> > +   wr_id = id >> 32;
> 
> So wr_id has always, ever, 32 lower bits set - why make it u64 then?

Because I later use it as wr_id << 32 | index | IPOIB_CM_OP_NOSRQ.
I could have used index | IPOIB_CM_OP_NOSRQ instead.

> 
> > +   /* There is a slender chance of a race between the stale_task
> > +    * running after a period of inactivity and the receipt of
> > +    * a packet being processed at about the same instant. 
> > +    * Hence the lock */
> 
> I think you can get rid of this, by changing the stale task code:
> move QP to error, and wait for WRs posted to complete.
> Then there won't be any more completions for this QP.
> 
> As it is, I'm not convinced you can't get a completion after
> QP has been removed out of the array - so it seems the race hasn't
> been solved here?

We have discussed this above.

> 
> We actually need something like this for CM too -
> stay tuned for a patch.
> 
> > +   spin_lock_irqsave(&priv->lock, flags);
> > +   rx_ptr = priv->cm.rx_index_ring[index];
> > +   spin_unlock_irqrestore(&priv->lock, flags);
> > +
> > +   priv->cm.rx_wr.wr_id = wr_id << 32 | index | IPOIB_CM_OP_NOSRQ;
> 
> Isn't this just id, again?

This is id | IPOIB_CM_OP_NOSRQ.

> > +static int ipoib_cm_post_receive(struct net_device *dev, u64 id)
> > +{
> > +   struct ipoib_dev_priv *priv = netdev_priv(dev);
> > +   int ret;
> > +
> > +   if (priv->cm.srq) 
> > +      ret = post_receive_srq(dev, id);
> > +   else 
> > +      ret = post_receive_nosrq(dev, id);
> > +
> > +   return ret;
> > +}
> 
> I think you can split this one now that srq/nonsrq completions are
> handled separately.

I don't understand this commennt.