[ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects

Sun Jul 1 12:05:16 PDT 2007

On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote:
> > SSQ is needed for scalability, no need to explain this (by 
> > the way RD is needed for the same reason too. What's Mellanox 
> > plan to support it?
> 
> RD is not supported in hardware today. Implementing RD is extremely 
> complicated. To solve the scalability issues on MPI like applications
> we believe that SRC and SSQ are the right solutions. It is much simpler
> for implementation by both software and hardware. By MPI-like I refer
> to applications that have some level of trust between two processes of
> the
> same application. RD also has some performance issues as it only 
> supports one message in the air. Those performance issues are solved
> by design in SRC/SSQ.
> 
Didn't know about RD limitation. Is this shortcomings of IB spec or
general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ.

> > It is a part of Spec after all, so why to invent new shiny 
> > staff when it is still possible to achieve better scalability 
> > without them).
> 
> It's truly about complexity. And as I mentioned in OFA meeting at
> Sonoma, 
> Mellanox is willing to contribute SRC/SSQ to the IB spec as well.
> 
> > We are discussing you implementation proposal and in my 
> > opinion it doesn't fit application needs. I may be wrong 
> > here, so if there is somebody who things that sending random 
> > completion to random processes it the best idea ever and 
> > absence of this "feature" is the only thing that stops him 
> > from IB adoption he may chime in here and voice his opinion.
> 
> Your input about how to demultiplex send completions on SSQ is 
> valuable. Unfortunately it is not supported in the current generation.
> What I can suggest here is, not new on this thread, but:
> 1) all pollers see the same CQ, only the poller that sees the completion
> that
>       belongs to takes it out of the CQ
Progress of one process depend on all other processes on the same node. Not
good at all.

> 2) only one process polls the CQ, if it doesn't belong to the poller,
> the
>       poller will put it in a SW queue to the right process. The other 
>       processes just poll on the SW queue
Not good of the same reason.

As the variant each process can poll HW CQ and SW CQ if completion from HW CQ
belong to another process put it on appropriate SW CQ. I don't think
that reasonable API will require such afford from applications (and I am
not talking about all locking overhead and cache bouncing that will
result from such implementation, but latency will be bad that's for sure).

> 3) the SQ will have a "completed WQE index" reported. Everybody can
>      look at it and determine how many WQEs completed. This one has
>      some cons because the CQ is not shared here... need to bake this 
>      one more.
And where application will get WC? Or should it maintain its own queue
of WQEs?

> If we wrap one of these into the right API, once there is HW available
> that 
> can do the SSQ CQ demultiplexing, it can work without any API change. 
> 
That is something I don't see in proposed API.

> > 
> > Looking at the Dror's slides on slide 6 "Scalable Reliable 
> > Connection" I see that wire protocol is extended to send DST 
> > SRQ as part of a header.
> > Receiver side then puts completion to appropriate CQ 
> > according this field. Have you proposition address this? How? 
> 
> SRC indeed includes demultiplexing of the CQ. SSQ does not currently,
> unfortunately.
Is it possible to add this only with FW upgrade?

> But I think that with the right API we can abstract this, and later on
> have better performance for it.
> 
> > Who will put this additional data on a wire (HW or libibverbs 
> > may be app)? Also I don't see this in Dror's slide, but 
> > completion of local operation should be demultiplexed to 
> > appropriate CQ too. WQE may contain additional field, for 
> > instance, that will tell where to put a completion. Once 
> > again who will do the demux in you proposition (HW, libiverbs 
> > or app)? The right answer is most certainly HW in both cases 
> > so will Hermon support this?
> > Or may be you want to demultiplex everything inside 
> > libibvers? In this case I want to see design of this 
> > (preferably with performance analysis).
> 
> One thing to mention. The way I see it is according to the order of the
> slides. First get SRC going, improve the scalability. Then SSQ can be
> added to further improve scalability. In other words I am suggesting
> that maybe we can worry with the SSQ deficiencies a bit later :)
> 
That is my point! Let's do it once lets do it right and lets do it when HW
is ready :)

--
			Gleb.