[ofa-general] Re: NOSRQ misc patch [PATCH V1]

Pradeep Satyanarayana pradeeps at linux.vnet.ibm.com
Sun Jul 22 17:06:10 PDT 2007


Michael S. Tsirkin wrote:
>> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
>> Subject: Re: NOSRQ misc patch [PATCH V1]
>>
>> Michael S. Tsirkin wrote:
>>>> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
>>>> Subject: Re: NOSRQ misc patch [PATCH V1]
>>>>
>>>> Michael S. Tsirkin wrote:
>>>>>> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
>>>>>> Subject: Re: NOSRQ misc patch [PATCH V1]
>>>>>>
>>>>>> Michael S. Tsirkin wrote:
>>>>>>>> @@ -1168,9 +1170,9 @@ static struct ib_qp *ipoib_cm_create_tx_
>>>>>>>>  	attr.recv_cq = priv->cq;
>>>>>>>>  	attr.srq = priv->cm.srq;
>>>>>>>>  	attr.cap.max_send_wr = ipoib_sendq_size;
>>>>>>>> -	attr.cap.max_recv_wr = 1;
>>>>>>>> +	attr.cap.max_recv_wr = 0;
>>>>>>>>  	attr.cap.max_send_sge = 1;
>>>>>>>> -	attr.cap.max_recv_sge = 1;
>>>>>>>> +	attr.cap.max_recv_sge = 0;
>>>>>>>>  	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
>>>>>>>>  	attr.qp_type = IB_QPT_RC;
>>>>>>>>  	attr.send_cq = cq;
>>>>>>> I don't see how does this fix things.
>>>>>>> This line 
>>>>>>>>  	attr.srq = priv->cm.srq;
>>>>>>> connected the TX QP to SRQ, making it possible to get packets on this QP.
>>>>>>> But if cm.srq is NULL, and a remote sends a packet on this connection,
>>>>>>> the connection will get closed. Which is a quality of implementation issue.
>>>>>>>
>>>>>> When the QP numbers are exchanged correctly, then it should not receive
>>>>>> a packet on this QP in the first place.
>>>>> Re-read the RFC. It is perfectly legal to reuse a passive QP for transmitting
>>>>> packets. We don't do this currently but we might in the future.
>>>> I presume you mean passive side for receiving.
>>> A passive side is the one that gets a REQ (look in IB spec section 12.9.6).
>>> Under IPoIB passive side can perform post send on the QP created.
>>> To make this work, I connect the QP to the SRQ on the active side:
>>>>         attr.srq = priv->cm.srq;
>>> However, with your patch, priv->cm.srq might be NULL, which
>>> means that the QP won't be attached to SRQ. This is
>>> a quality of implementation issue that your patch is introducing.
>>>
>> I do not understand -for one you mention transmitting packets and, on the
>> other hand you mention SRQ.
>> I have already tested the series of NOSRQ patches for interoperability between 
>> IBM and Mellanox adapters and it works. I do not see the quality of implementation
>> issues that you keep referring to.
> 
> Do you understand why is this line there?
>          attr.srq = priv->cm.srq;
> 
> I'll try to explain.
> 
> The snippet above creates a QP (that will then be connected to remote side).
> According to IPoIB CM RFC, once the connection is set up, and if the remote
> wants to send us some packets, it could send them over this
> existing connection.
> 
> And with current code, this would work correctly, because
> all RC QP we create are connected to the common SRQ and a common CQ,
> thus such packets get receive WCs and get sent up the stack.
> 
> But with your patch, priv->cm.srq == NULL so you create RC QPs that
> are not connected to SRQ, and you never post any receive WRs,
> so if the remote sends even a single packet on this QP, the QP
> will transfer to error state.
> 
I do not post any WRs because I do not expect any packets to be received.
If it does receive any packets an RNR will be returned (as expected).
The Queues in the Queue Pairs are not being used symmetrically that is all.
Also the priv->cm.srq is to NULL only in the non-SRQ case. The SRQ case 
is as before.

> This is a regression: QPs are supposed to have receive WRs
> preposed. If you consider e.g. TCP, it's easy to imagine that
> the packet remote was sending was an ACK, so it won't
> retry - until we destroy the connection, create a new one,
> resend the packet - and an ACK will kill the QP again.
> 

There is nothing about asymmetric usage of the Queues. And hence
I see no problems. If in TCP one sends to the wrong port, the 
packet gets dropped. This is similar to that.

> This just happens to never occur for you because you have recent linux kernel
> on both sides of the link, and because linux is currently not smart enough
> to reuse an existing connection - so it creates a new one and
> your bug is hidden from view.
> 

This code has been there since day one. I do not understand the reasoning for raising 
issues on the eve of the acceptance of this patch. Why bring it up now?

Pradeep




More information about the general mailing list