[ofa-general] Re: IPOIB CM (NOSRQ) extension

Pradeep Satyanarayana pradeeps at linux.vnet.ibm.com
Tue Jun 12 10:11:52 PDT 2007


Michael S. Tsirkin wrote:
>> Quoting Michael S. Tsirkin <mst at dev.mellanox.co.il>:
>> Subject: Re: IPOIB CM (NOSRQ) extension
>>
>>> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
>>> Subject: Re: IPOIB CM (NOSRQ) extension
>>>
>>> Michael S. Tsirkin wrote:
>>>>> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
>>>>> Subject: IPOIB CM (NOSRQ) extension
>>>>>
>>>>> This patch handles the corner case of running out of RC QPs. In that
>>>>> case it switches to UD mode. This patch can be used both by NOSRQ and
>>>>> SRQ code.
>>>>>
>>>>> Signed-off-by: Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>
>>>> You don't provide any way to retry going back to connected mode,
>>>> after a failure, which is really intermittent by nature. That's pretty bad.
>>> This node switched to datagram mode, because the passive side was
>>> under a resource crunch (no RC QPs). And, the user is indeed alerted
>>> about this condition. So, yes we do not attempt to go back to connected
>>> mode.
>> Need to retry switching to datagram mode after a while.
> 
> Sorry, that should have been "switching to connected mode".

So, you are suggesting that we ping-pong between datagram mode and
connected mode. In the first place I was opposed to just switching to
datagram mode when there are no RC QPs. This suggestion goes even
further.

We seem to have polar opposite view points on this issue. And rather
than simply persisting with our viewpoints we need to back that up with
more concrete reasoning.

The reason I disagree with this approach is for the following reasons:

1) This switch to datagram mode happens when we are in a resource crunch
kind of situation. The resource crunch should be flagged and corrective
action needs to be taken. Switching to datagram mode simply prolongs the
agony.

2) Ping-Ponging between connected mode and datagram mode makes the
situation even worse. In HPC environments cluster nodes simply do not
appear and disappear. They continue to stay on (in the cluster). So,
trying to switch to connected mode does not achieve any purpose.

Can you tell me why "switching to connected mode" is a must?

Pradeep







More information about the general mailing list