[ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review

Michael S. Tsirkin mst at dev.mellanox.co.il
Mon May 14 23:26:46 PDT 2007


> Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
> Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
> 
> Michael S. Tsirkin wrote:
> >>Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
> >>Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
> >>
> >>Michael S. Tsirkin wrote:
> >>>>Quoting Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com>:
> >>>>Subject: Re: IPOIB CM (NOSRQ)[PATCH V4] patch for review
> >>>>
> >>>>If there are no other issues than the small restructure suggestion that
> >>>>Michael had, can this patch be merged into the for-2.6.22 tree?
> >>>I'm not sure.
> >>>
> >>>I haven't the time, at the moment, to go over the patch again in depth.
> >>>Have the issues from this message been addressed?
> >>>
> >>>http://www.mail-archive.com/general@lists.openfabrics.org/msg02056.html
> >>>
> >>>Just a quick review, it seems that two most important issues have
> >>>apparently not been addressed yet:
> >>>
> >>>1. Testing device SRQ capability twice on each RX packet is just too 
> >>>ugly,
> >>>  and it *should* be possible to structure the code
> >>>  by separating common functionality in separate
> >>>  functions instead of scattering if (srq) tests around.
> >>I have restructured the code as suggested. In the latest code, there are
> >>only two places where SRQ capability is tested upon receipt of a packet:
> >>a) ipoib_cm_handle_rx_wc()
> >>b)ipoib_cm_post_receive()
> >>
> >>Instead of the suggested change to ipoib_cm_handle_rx_packet() it is
> >>possible to change ipoib_cm_post_receive() and call the  srq and nosrq
> >>versions directly, without mangling the code. However, I do not believe 
> >>that this should be stopping us from the code being merged. This can 
> >>handled as a separate patch.
> >
> >I actually suggested implementing separate poll routines
> >for srq and non-srq code. This way we won't have *any* if(srq)
> >tests on datapath.
> 
> Right, I remember you suggested that. From a maintainability perspective
> I use as much common code as possible.

Sprinkling if (srq) all over the code is not necessarily the best wait to reuse code.
Moving common code to separate functions is a better way IMO.

> Therefore I did not implement
> separate polling routines as suggested. So, it boils down to one if(srq)
> in the data path.

Which patch are we discussing? Patch V4 has 3 of these on data path.
The one in alloc_rx_skb also seems to be open-coded - bad for cache usage.

> I really do not think that should be a point of contention.

True, it's not a *major* point, scalability is still a larger issue.
But IMO fixing this would make the patch less ugly.

> >>>2. Once the number of created connections exceeds
> >>>  the constant that you allow, all attempts to communicate
> >>>  with this node over IP over IB will fail.
> >>>  A way needs to be designed to switch to the datagram mode,
> >>>  and to retry going back to connected after some time.
> >>>  [We actually have this theoretical issue in SRQ
> >>>   as well - it is just much more severe in the nonSRQ case].
> >>Firstly, this has now been changed to send a REJ message to the remote
> >>side indicating that there no more free QPs.
> >
> >Since the HCA actually has free QPs - you are actually running out of buffers
> >that you are ready to prepost - one might argue about whether this is spec
> >compliant behaviour.  This is something that might better be checked up with
> >at IBTA.
> >
> >>It is up to the application to handle the situation.
> >
> >The application here being kernel IP over IB here, it currently handles the
> >reject by dropping outstanding packets and retrying the connection on the
> >next packet to this dst.  So the specific node might be denied connectivity
> >potentially forever.
> 
> When I stated application, I did not mean IPoIB. I meant the user level
> app. Yes, the app will keep on retrying to establish connection to the
> specified node using Connected Mode and then subsequently time out.

So, how would an application handle the situation?

> See more comments below.
> 
> >
> >>Previously, this was flagged as an error that
> >>appeared in /var/log/messages.
> >>
> >>However, here are a few other things we need to consider. Lets us
> >>compute the amount of memory consumed when we run into this situation:
> >>
> >>In CM mode we use 64K packets. Assuming, the rx_ring has 256 entries and
> >>the current limitation of 1024 QPs, NOSRQ only will consume 16GB of 
> >>memory. All else remaining the same if we change the rx_ring size to 
> >>1024, NOSRQ will consume 64GB of memory.
> >>
> >>This is huge and my guess is that on most systems, the application will 
> >>run out of memory before it runs out of RC QPs (with NOSRQ).
> >>
> >>Aside from this I would like to understand how do we switch just the 
> >>"current" QP to datagram mode; we would not want to switch all the
> >>existing QPs to datagram mode -that would be unacceptable. Also, we
> >>should not prevent subsequent connections using RC QPs. Is there 
> >>anything in the IB spec about this?
> >
> >Yes, this might need a solution at the protocol level, as you indicate 
> >above.
> 
> I thought through this some more and I do not believe that this is such
> a good idea (i.e. switching to datagram mode). The app (user level) is
> expecting to use RC and we silently (or even with warnings) switch to
> UD mode -I do not think that is appropriate.

Which app is this?

> The app should time out or be returned an error and maybe the app can
> switch to using another node that has the requested resources. The onus
> is on the user level app to take appropriate action.

Most applications can't do this however. So your patch will break them.

> The equivalent situation in a non IB environment would be when the
> recipient node has no more memory to respond to an arp request. The
> app receives a "node unreachable" message.

I think you mean that on a TCP socket, connect will return ENETUNREACH, rather
than a message?  But since ARP is normally using multicast, if the remote won't
accept connections, this is *not* what will happen here, is it?

> Therefore I am inclined to say we should leave this as is.

The main difference is that on a LAN, arp timeouts don't really occur
too often in practice - they are sufficiently rare that lots of
applications regard TCP errors as a "node is down" indication.

> >
> >>I think solving this is a fairly big issue and not just specific to
> >>NOSRQ. NOSRQ is just exacerbating the situation. This can be dealt with
> >>all at once with SRQ and NOSRQ, if need be.
> >
> >IMO, the memory scalability issue is specific to your code.
> >With current code using shared RQ, each connection needs
> >an order of 1KByte of memory. So we need just 10MByte
> >for a typical 10000 node cluster.
> >
> 
> Right, I have always maintained that NOSRQ is indeed a memory hog. I
> think we must revisit this memory computation for the srq case too -
> I would say the receive buffers consumed would be 64K (packet size) *
> 1000 (srq_ring_size) is 64MBytes, irrespective of the number of the
> number of nodes in the cluster. However, the question that is still
> unanswered (at least in my mind) is, will 1000  buffers be sufficient
> to support a 10,000 or even a 1000 node cluster. On just a 2 node
> cluster (using UD) we had seen previously that a receiveq_size of 256
> was inadequate.

You should distinguish between occasional packet drops due to RQ overrun,
which happens all the time on the internet, so protocols are built to handle
it, and dropping *all* packets to a specific destination, which is
a quality of implementation issue.

> I would guess even in the SRQ case that would be true.

Less likely, since each buffer is 32 times larger now.  Further, with SRQ we can
auto-tune the buffer size by using watermark events. Stay tuned.

> To support large clusters one will run into memory issues even in the
> SRQ case,

I don't really think IPoIB with SRQ will run into memory issues even with
large clusters.

> but it will occur much sooner in the NOSRQ case.
>
> >>Hence, I do not see these as impediments to the merge.
> >
> >In my humble opinion, we need a handle on the scalability issue
> >(other than crashing or denying service) before merging this,
> >otherwise IBM will be the first to object to making connected mode the 
> >default.
> 
> I will seek the opinion from folks who use applications on large
> clusters within IBM. I have always stated that NOSRQ should be used
> only when there are a handful or at most a few dozen clusters. I will
> try and make this well known so that this does not come as a surprise.

One of my targets is to make connected mode the default, eventually.
My concern is that if that enabling connected mode breaks applications,
as your patch does, people will be afraid to turn it on.

-- 
MST



More information about the general mailing list