[ofa-general] Re: IPOIB CM (NOSRQ) patch -memory footprint

Wed May 23 22:38:19 PDT 2007

> Here are my thoughts about limiting the memory footprint for IPOIB CM
> (NOSRQ) patch:
> 
> By default, cap the NOSRQ memory usage to 1GB.

ppc systems I have, start crashing if you map as much as 300MB for DMA.

> The default recvq_size
> is set to 128. Therefore for 64KB packets this would imply a maximum of
> 128 endpoints.
> 
> -Make the maximum number of endpoints a module parameter with a default
> value of 128.
> 
> -The NOSRQ limit of 1GB is also made a module parameter. However, 1GB is
> the default limit and could be changed as needed (by the administrator)
> depending on the system configuration, application needs and so on.

All this need for manual tuning is really going in the wrong direction:
we should be looking for ways to get rid of existing module
parameters, like using low watermark event to dynamically tune the RQ
depth.

> The
> server would return a "REJ" message upon receiving a "REQ", whenever one
> of these limits (i.e. max number of endpoints or the max NOSRQ memory
> usage) is reached. Currently, we only check for the maximum number of
> endpoints -hard coded to 1024.

So with limit sufficiently low, we hopefully will avoid crashing the server.
That's a progress, but what happens to the client when it gets this reject?

> -The IPOIB CM (NOSRQ) patch is practically transparent to HCAs that
> support SRQ like the Topspin HCA and, such HCAs should not be
> impacted at all.

I don't think it's that clean yet.

Here's an idea: implement "fake SRQ" for ehca in software: make post recv on srq
queue the WR, spread them evenly between QPs as they connect.  Once # of QPs
goes above some limit, create QP command will fail.  This would contain the mess
nicely inside ehca (I think you'll want to add a flag that lets software
figure out that SRQ is fake).

We will still be left with the basic problem of what to do
at the active side upon the reject, though.

> -Currently we allocate a default of 64KB for the ring buffer elements,
> and this buffer size is not linked to the mtu. In the future, we could
> allocate buffers based on the mtu and link that into the computation of
> the memory cap. This way customers who might want to use a smaller mtu
> could use a larger number of endpoints, or a larger recvq_size without
> exceeding the memory cap.

I think that conceptually, global MTU config is intended for outgoing packets,
not for the RX buffers. For example, how would we handle MTU changes?

> Would this approach address the issues of scalability and enable IPOIB
> CM to be turned as the default?

For IPoIB CM to be the default, it needs to work as well as datagram mode for
most usage scenarious. Unfortunately, your proposal above seems to fail to
satisfy this requirement: it will improve speed in some scenarious,
but will either increase the need for manual configuration drastically
or cause denial of service or use up huge amount of memory,
in others.

I think that to be able to use connected mode on ehca, what you need is

1. Find a way to make IPoIB fall back on datagram mode when you run out of
   resources.  This might need to be addressed at the protocol level.
2. Separate the noSRQ hacks more cleanly. I suggested some ways to do this
   earlier. Maybe, "fake srq" above will be a good way to solve it.

-- 
MST