[ofa-general] Re: IPOIB CM (NOSRQ) patch -memory footprint

Pradeep Satyanarayana pradeeps at linux.vnet.ibm.com
Fri May 25 13:01:29 PDT 2007


Michael S. Tsirkin wrote:
>> Here are my thoughts about limiting the memory footprint for IPOIB CM
>> (NOSRQ) patch:
>>
>> By default, cap the NOSRQ memory usage to 1GB.
> 
> ppc systems I have, start crashing if you map as much as 300MB for DMA.

If PPC systems start crashing when you map as much as 300MB, then yes
that would be a gating factor when you deploy this patch for certain
configurations. Then MPI applications (on UD) allocating more than 300
MB should be crashing the systems even without this patch -right?

This is a separate problem and clouds this discussions. Please post the
problem on the ppc/ppc64 mailing list.

> 
>> The default recvq_size
>> is set to 128. Therefore for 64KB packets this would imply a maximum of
>> 128 endpoints.
>>
>> -Make the maximum number of endpoints a module parameter with a default
>> value of 128.
>>
>> -The NOSRQ limit of 1GB is also made a module parameter. However, 1GB is
>> the default limit and could be changed as needed (by the administrator)
>> depending on the system configuration, application needs and so on.
> 
> All this need for manual tuning is really going in the wrong direction:
> we should be looking for ways to get rid of existing module
> parameters, like using low watermark event to dynamically tune the RQ
> depth.
> 
>> The
>> server would return a "REJ" message upon receiving a "REQ", whenever one
>> of these limits (i.e. max number of endpoints or the max NOSRQ memory
>> usage) is reached. Currently, we only check for the maximum number of
>> endpoints -hard coded to 1024.
> 
> So with limit sufficiently low, we hopefully will avoid crashing the server.
> That's a progress, but what happens to the client when it gets this reject?


> 
>> -The IPOIB CM (NOSRQ) patch is practically transparent to HCAs that
>> support SRQ like the Topspin HCA and, such HCAs should not be
>> impacted at all.
> 
> I don't think it's that clean yet.
> 
> Here's an idea: implement "fake SRQ" for ehca in software: make post recv on srq
> queue the WR, spread them evenly between QPs as they connect.  Once # of QPs
> goes above some limit, create QP command will fail.  This would contain the mess
> nicely inside ehca (I think you'll want to add a flag that lets software
> figure out that SRQ is fake).
> 
> We will still be left with the basic problem of what to do
> at the active side upon the reject, though.

As you indicate this will not solve the problem, so it is not an option.

> 
>> -Currently we allocate a default of 64KB for the ring buffer elements,
>> and this buffer size is not linked to the mtu. In the future, we could
>> allocate buffers based on the mtu and link that into the computation of
>> the memory cap. This way customers who might want to use a smaller mtu
>> could use a larger number of endpoints, or a larger recvq_size without
>> exceeding the memory cap.
> 
> I think that conceptually, global MTU config is intended for outgoing packets,
> not for the RX buffers. For example, how would we handle MTU changes?
> 
>> Would this approach address the issues of scalability and enable IPOIB
>> CM to be turned as the default?
> 
> For IPoIB CM to be the default, it needs to work as well as datagram mode for
> most usage scenarious. Unfortunately, your proposal above seems to fail to
> satisfy this requirement: it will improve speed in some scenarious,
> but will either increase the need for manual configuration drastically
> or cause denial of service or use up huge amount of memory,
> in others.

My viewpoint is that this is akin to a Qos issue. If the request cannot
be satisfied then return an error to the user level application and let
the application decide, what to do. Don't do anything under the covers.

I have tried to explain that this corner case you cite will be
encountered by PPC systems using IBM HCA only. The rest will be
unaffected. The PPC systems deployed as cluster nodes are unlikely to
encounter this problem.

However, we seem to have ideas that are at the opposite end of the
spectrum and any amount of debate will not resolve this issue. One
idea to move this discussion forward is to implement both options in the
corner case and let the user/sys admin choose:
a) return error to user level application and leave it to the
application when there are no more RC QPs b) switch the active side to
using datagram mode when there are no RC QPs.

If we choose to go this route then that will mean we need yet another
module parameter to let the user decide, or worse compile time options -
neither of which is palatable. Any other suggestions?

If we can agree upon this approach I will start another thread to
discuss just this corner case and with this patch (or a minor variant)
permit IPOIB CM to be used as the default. I do not want the corner
case to be the gating factor for this patch.

Pradeep






More information about the general mailing list