[ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects

Michael S. Tsirkin mst at dev.mellanox.co.il
Tue Jun 26 05:58:02 PDT 2007


> > No, sharing a send queue must be done in software.  I don't really see the reason
> > for sarcasm: do you see value in sharing resources between multiple threads?
> > Why not multiple processes? Some people just don't want to program
> > in multithreaded environment.
>
> Yes I see the value in sharing resources between threads and processes
> if done right. This proposition is far from being right.

Ahem, *what* are you talking about? Sharing resources between threads was supported in
libibverbs 1.0, *right from the start*. This is still the case with 1.1, and this API
matches verbs quite closely which means that it can work pretty much on any
hardware.

You want to propose some enhancements, go ahead (and open a new thread for this).
All *I* want to do is support sharing resources in singlethreaded environment.

> There is not sarcasm in my sentence either. You can't claim that what you
> propose is as seamless as it should be.

I think it's as seamless as it *can* be.

> I have no problem with sharing send queue. What I want to be able to do
> is to attach CQ from each process to a shared QP. When send posted by
> process A completes the completion is posted into A's CQ. HW should be
> able to multiplex this IMO. 

Well, since there is no hardware that does this, why bother discussing this?

> > > > > If multiple processes what to post to the same QP how will you
> > > > > ensure that right process will receive right completion event?
> > > > 
> > > > Same as with threads - memory for CQEs and locks will be allocated
> > > > in shared memory to make it possible for multiple processes to poll
> > > > CQ simultaneously, and they get completions in FCFS order.
> > > > What to do with them is up to the user.
> > >
> > > Are you going to use this API? How? There is no point in discussing user
> > > API without specifying HOW user will be using it. You have to ask what
> > > user want and design your API accordingly and not other way around.
> > > So suppose I want to use proposed API to implement super scalable MPI.
> > 
> > We'd come up with MPI_Send implementation inside libibverbs:). Think layered - I'd
> > like to make a minimal possible API change to make scalability improvements
> > possible.
> 
> They are not really possible with proposed API (beyond academic papers that is).

I'm talking to MPI guys here, too, so I don't think there's real danger
that the final API will be useless for them.

> You are
> welcome to implement MPI_Send inside libibverbs. After all this is what Myricom did.

I think keeping a general verbs layer is a better approach for now.

> > 
> > > I setup shared QP/CQ/... and each rank start to post into the QP and
> > > receive completion from CQ and suppose rank A picked completion that
> > > belongs to rank B so I will need to setup out of band channel to pass
> > > this completion from A to B. This is not looks good at all to me.
> > 
> > This is not different from multiple threads sharing a CQ, really - and we do
> This is very different from  multiple threads sharing a CQ. In
> multi threaded  scenario I can design my program in a way that each
> thread will be able to handle completion. We'll have to pass 
> completion between processes in the scenario you propose.
> 
> > support this already.  In the part of the message that you have cut out, I
> > showed some use cases that avoid this "side channel"
>
> What? RDMA?

RDMA and SRC.

> What about a completion of RDMA operation? You'll have to
> pass it around.

Since all it does it free up the buffers, it's quite possible
that processing of send completions can be done by any process.
This really depends on how the application wants to do this:
again, you seem to ignore the fact that the issue is the same for
multithreaded programs, and they seem to cope fine.

> I agree that RDMA situation is much better then
> send/receive one, but there is no RDMAs without send/recv after it.

Not really - polling on data has been used in MPI for ages now.
With SRC you can have separate completions on the receive side.

> > (which could be just shared memory btw).
>
> And you introduce another scalability problem here. On a big SMP node
> will have to create channel between each pair of processes to pass
> completions and will have to poll each one of them besides polling CQ.
> Here goes you latency. And I am not saying this is not possible, I am
> saying it is so bad that it is not worth doing.

No, you got that wrong: there need not be any real "channels" with shared
memory: just a single data structure shared by all processes woul do.
But again, you are getting into MPI design, which is the wrong layer to discuss here.

-- 
MST



More information about the general mailing list