[ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects

Tue Jun 26 06:33:17 PDT 2007

On Tue, Jun 26, 2007 at 03:58:02PM +0300, Michael S. Tsirkin wrote:
> > > No, sharing a send queue must be done in software.  I don't really see the reason
> > > for sarcasm: do you see value in sharing resources between multiple threads?
> > > Why not multiple processes? Some people just don't want to program
> > > in multithreaded environment.
> >
> > Yes I see the value in sharing resources between threads and processes
> > if done right. This proposition is far from being right.
> 
> Ahem, *what* are you talking about? Sharing resources between threads was supported in
> libibverbs 1.0, *right from the start*. This is still the case with 1.1, and this API
> matches verbs quite closely which means that it can work pretty much on any
> hardware.
Why do you think that I have a problem with multithreaded application is
beyond my understanding. I have a problem with you thinking that peaking a
completion by random process in FCFS order is a good idea. It has limited
use for specially designed application. MPI is not one of them.

> 
> You want to propose some enhancements, go ahead (and open a new thread for this).
> All *I* want to do is support sharing resources in singlethreaded environment.
> 
You asked for RFC? Don't do it next time if you don't want to hear any.

> > There is not sarcasm in my sentence either. You can't claim that what you
> > propose is as seamless as it should be.
> 
> I think it's as seamless as it *can* be.
If it can't be better it is not worth to be implemented. This my opinion. I can stop 
you from doing it :)

> 
> > I have no problem with sharing send queue. What I want to be able to do
> > is to attach CQ from each process to a shared QP. When send posted by
> > process A completes the completion is posted into A's CQ. HW should be
> > able to multiplex this IMO. 
> 
> Well, since there is no hardware that does this, why bother discussing this?
Because Mellanox is a hardware company, so do improvements in the right
place and don't add craft to library just to claim that you are super
scalable. If it can't be implemented in HW then can you explain why
please?

> 
> > > > > > If multiple processes what to post to the same QP how will you
> > > > > > ensure that right process will receive right completion event?
> > > > > 
> > > > > Same as with threads - memory for CQEs and locks will be allocated
> > > > > in shared memory to make it possible for multiple processes to poll
> > > > > CQ simultaneously, and they get completions in FCFS order.
> > > > > What to do with them is up to the user.
> > > >
> > > > Are you going to use this API? How? There is no point in discussing user
> > > > API without specifying HOW user will be using it. You have to ask what
> > > > user want and design your API accordingly and not other way around.
> > > > So suppose I want to use proposed API to implement super scalable MPI.
> > > 
> > > We'd come up with MPI_Send implementation inside libibverbs:). Think layered - I'd
> > > like to make a minimal possible API change to make scalability improvements
> > > possible.
> > 
> > They are not really possible with proposed API (beyond academic papers that is).
> 
> I'm talking to MPI guys here, too, so I don't think there's real danger
> that the final API will be useless for them.
So let them talk and specify here how they are gonna use it and we will have
good use case for your design.

> 
> > You are
> > welcome to implement MPI_Send inside libibverbs. After all this is what Myricom did.
> 
> I think keeping a general verbs layer is a better approach for now.
Then don't propose something you are not going to implement.

> 
> > > 
> > > > I setup shared QP/CQ/... and each rank start to post into the QP and
> > > > receive completion from CQ and suppose rank A picked completion that
> > > > belongs to rank B so I will need to setup out of band channel to pass
> > > > this completion from A to B. This is not looks good at all to me.
> > > 
> > > This is not different from multiple threads sharing a CQ, really - and we do
> > This is very different from  multiple threads sharing a CQ. In
> > multi threaded  scenario I can design my program in a way that each
> > thread will be able to handle completion. We'll have to pass 
> > completion between processes in the scenario you propose.
> > 
> > > support this already.  In the part of the message that you have cut out, I
> > > showed some use cases that avoid this "side channel"
> >
> > What? RDMA?
> 
> RDMA and SRC.
> 
> > What about a completion of RDMA operation? You'll have to
> > pass it around.
> 
> Since all it does it free up the buffers, it's quite possible
> that processing of send completions can be done by any process.
No it can't in case of MPI. MPI also progress user request on the event.
Yes, you can design program where it will be possible, but not MPI.

> This really depends on how the application wants to do this:
> again, you seem to ignore the fact that the issue is the same for
> multithreaded programs, and they seem to cope fine.
No you sees to ignore the fact that multithreaded program is something
_completely_ different. In multithreaded program _all_ state is shared
between processes. In multiprocess scenario only a state you place into
shared memory is shared. This difference is very important.

> 
> > I agree that RDMA situation is much better then
> > send/receive one, but there is no RDMAs without send/recv after it.
> 
> Not really - polling on data has been used in MPI for ages now.
You are greatly misinformed. Polling on data used only for limited
number of peers for sending small messages and works only on Mellanox HCA
on _some_ archs and greatly non-scalable in memory consumption and
polling time. Go ask your MPI team.

> With SRC you can have separate completions on the receive side.
> 
> > > (which could be just shared memory btw).
> >
> > And you introduce another scalability problem here. On a big SMP node
> > will have to create channel between each pair of processes to pass
> > completions and will have to poll each one of them besides polling CQ.
> > Here goes you latency. And I am not saying this is not possible, I am
> > saying it is so bad that it is not worth doing.
> 
> No, you got that wrong: there need not be any real "channels" with shared
> memory: just a single data structure shared by all processes woul do.
> But again, you are getting into MPI design, which is the wrong layer to discuss here.
> 
I am talking about only application this is meant to be used by (in
short term anyway). So if the design is bad for MPI it is bad. About
"channels" you either create one between each pair of ranks or you use
locking. Both solutions kills latency.

--
			Gleb.