[ofa-general] Re: IPoIB CM (NOSRQ) [PATCH 1] review

Sean Hefty mshefty at ichips.intel.com
Tue Sep 18 12:34:29 PDT 2007


> A linear search accommodates intermittent connection and disconnection of
> QPs, if ever there is a need. This search does not happen upon the packet
> receive path and hence will not pose performance issues.

I'm not sure how frequently connections are made, or if they're long 
lived.  Avoiding the linear search costs an extra integer per QP, but I 
do think this optimization can come later.

>> Would multiple values be better here?  Something like: max_conn_qp,
>> qp_type, and use_srq.
> 
> 
> One of the goals was to keep the number of module parameters to a minimum.
> Currently, UD is the default qp_type and when one switches to connected 
> mode
> RC qps are used. At init we determine if the HCA support srq or not. So, 
> this
> too is not required as a module param.

Personally, I don't believe that automatic use of SRQ is desirable.  And 
the code differences needed to support RC versus UC (without SRQ) seem 
trivial.  So, rather than having one parameter mean: the maximum number 
of RC QPs with SRQ if it exists, but without SRQ if it doesn't exist, 
let's separate these values.  Even if UC support isn't added right away, 
it would be better to expose the right values initially then have to 
change them later.

> We compute the mask NOSRQ_INDEX_MASK based on max_rc_qp. This is used to 
> compute the wr_id through a bitwise AND. Hence we need that to be a power 
> of 2.

I'm saying that we don't need to restrict the number of QPs to a power 
of 2.  We only need to restrict it to less than 2^(number of bits that 
we want to dedicate from the wr_id to find the QP).  E.g. it's okay to 
have 4-bit or 30-bit masks, but only support 12 QPs.

> It is illustrative to see how people view the same thing differently  :) 
> That had never occurred to me. Any suggestions?

Yes - multiple parameters :)

qp_type
use_srq
max_qp
qp_size (covered by send_queue_size & recv_queue_size?)
message_size (covered by mtu?)

Yes - it's a lot of parameters, but I think they're needed to support 
connected mode.  If an admin can control the QP sizes and MTU, then they 
only need to limit the number of QPs.

It'd be nice to get some other's input.

>>> +
>>> +    /* In the SRQ case there is a common rx buffer called the 
> srq_ring.
>>> +     * However, for the NOSRQ case we create an rx_ring for every
>>> +     * struct ipoib_cm_rx.
>>> +     */
>>> +    p->rx_ring = kzalloc(ipoib_recvq_size * sizeof *p->rx_ring,
>>> GFP_KERNEL);
>>> +    if (!p->rx_ring) {
>>> +        printk(KERN_WARNING "Failed to allocate rx_ring for 0x%x\n",
>>> +               qp_num);
>>> +        return -ENOMEM;
>>> +    }
>>> +
>>> +    spin_lock_irq(&priv->lock);
>>> +    list_add(&p->list, &priv->cm.passive_ids);
>>> +    spin_unlock_irq(&priv->lock);
>>> +
>>> +    init_context_and_add_list(cm_id, p, priv);
> 
>> stale_task thread could be executing on 'p' at this point.  Is that
>> acceptable?  (I'm pretty sure I pointed this out before, but I don't
>> remember what the response was.)
> 
> 
> In the previous review of version v6, you had caught bug (which I 
> concurred)
> That has been fixed now.
> 
> 
>> We just added 'p' to the passive_ids list here, but
>> init_context_and_add_list() also adds it to the list, but only in the
>> srq case.  It would be cleaner to always just add it to the list in
>> init_context_and_add_list() or always do it outside of the list.
> 
> 
> I am not sure I understand this. init_context_and_add_list() adds to the 
> list
> conditionally.

The end result of the code is that p->list is always added to 
priv->cm.passive_ids list:

no SRQ case - allocate_and_post_rbuf_nosrq():

	spin_lock_irq(&priv->lock);
	list_add(&p->list, &priv->cm.passive_ids);
	spin_unlock_irq(&priv->lock);

	init_context_and_add_list(cm_id, p, priv);

SRQ case - ipoib_cm_req_handler():

	if (priv->cm.srq) {
		p->state = IPOIB_CM_RX_LIVE;
		init_context_and_add_list(cm_id, p, priv);
	}

     init_context_and_add_list(...)
     {
	...
	if (priv->cm.srq) {
		if (p->state == IPOIB_CM_RX_LIVE)
			list_move(&p->list, &priv->cm.passive_ids);

Why can't this always just be done as:

static void init_context_and_add_list(...)
{
	...
	list_add(&p->list, &priv->cm.passive_ids);
	...
}


> Would it better to not drop the lock at all, but hold it till all 3 are 
> done?
> This is not in the packet receive path, and hence not critical.

It's not the performance that's bothering me so much as the code being 
structured in such a way that the same lock is acquired/released 3 times 
back to back to back.

> As mentioned previously the state applies to srq only. I did not combine
> the routines since they are quite small and this does get used in the 
> packet receive path. 

The state applying to srq only was a design decision.  It looks like it 
could be used in the no srq case, but wasn't.

>>>      if (unlikely(wr_id >= ipoib_recvq_size)) {
> 
>> Why would this ever occur?
> 
> 
> If you see the previous IPoIB code (even UD) this has always been there-
> probably to detect WQE corruption?

IMO, I would toss these checks then.  It only protects some of the wr_id 
bits against a fairly limited type of corruption.

>>> +        ipoib_warn(priv, "cm recv completion event with wrid %lld (>
>>> %d)\n",
>>> +                   (unsigned long long)wr_id, ipoib_recvq_size);
>>> +        return;
>>> +    }
>>> +
>>> +    index = (wc->wr_id & ~IPOIB_CM_OP_RECV) & NOSRQ_INDEX_MASK;
>>> +
>>> +    /* This is the only place where rx_ptr could be a NULL - could
>>> +     * have just received a packet from a connection that has become
>>> +     * stale and so is going away. We will simply drop the packet and
>>> +     * let the hardware (it s IB_QPT_RC) handle the dropped packet.
> 
>> I don't understand this comment.  How can the hardware handle a packet
>> dropped by software?
> 
> 
> Under the conditions described we drop the packet and since it is an RC
> connection, the remote side will detect a timeout and the hardware will 
> detect it and automatically initiate a retransmission -till a 
> RETRY_EXCEEDED
> error occurs.

This still doesn't make sense to me.  An ACK was already generated by 
the local hardware.  Tossing the receive doesn't cause the remote 
hardware to resend the packet.

>> If the completion can be for a connection that has gone away, what's to
>> prevent a new connection from grabbing the same slot in the
>> rx_index_table.  If this occurs, then the completion will reference the
>> wrong connection.
> 
> 
> It does not matter if after a connection has gone away if a new connection 
> grabs
> the same slot (that is likely to happen with the linear search). If the 
> old
> connection comes back it will get a new slot in the rx_index_tabe.

Yes - but a receive for the old connection will reference the rx_table 
index for the new connection.  See below:

>>> +     * In the timer_check() function below, p->jiffies is updated and
>>> +     * hence the connection will not be stale after that.
>>> +     */
>>> +    rx_ptr = priv->cm.rx_index_table[index];
>>> +    if (unlikely(!rx_ptr)) {
>>> +        ipoib_warn(priv, "Received packet from a connection "
>>> +               "that is going away. Hardware will handle it.\n");
>>> +        return;
>>> +    }

If this check can ever succeed, then it's also possible for rx_ptr to 
reference the wrong connection.  rx_table[index] should not be freed 
until all receives associated with that QP have been processed.

> There have been a lot of discussions about this very issue. It was 
> strongly
> suggested that I keep the if(srq) checks to a bare minimum, especially 
> since
> this is in the packet receive path.

I agree that we should limit checks in the receive path, but duplicating 
a fair amount of code to avoid 1 extra check doesn't seem worth it. 
We'll end up taking the same branch every time anyway.  I'd rather make 
the code easy to maintain first, with the burden placed on showing the 
performance gained by removing the branch.

> That may be possible. However, in the no srq case we need to do that upon 
> receipt
> of every REQ, whereas for srq we need to do that only once. That is why it 
> is
> convenient to do it here.

It ends up using a lot of memory.  For the SRQ case, having something 
adaptable may be better.  Rather than posting the maximum up front, post 
  only recv_queue_size receives with each connection (up to some total 
maximum).  As connections go away, let the number of posted receives 
drop back down as buffers are consumed.  This change is outside this 
patch though.

- Sean



More information about the general mailing list