[ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects

Dror Goldenberg gdror at dev.mellanox.co.il
Mon Jul 2 04:00:56 PDT 2007


Gleb Natapov wrote:
> On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote:
>   
>>> SSQ is needed for scalability, no need to explain this (by 
>>> the way RD is needed for the same reason too. What's Mellanox 
>>> plan to support it?
>>>       
>> RD is not supported in hardware today. Implementing RD is extremely 
>> complicated. To solve the scalability issues on MPI like applications
>> we believe that SRC and SSQ are the right solutions. It is much simpler
>> for implementation by both software and hardware. By MPI-like I refer
>> to applications that have some level of trust between two processes of
>> the
>> same application. RD also has some performance issues as it only 
>> supports one message in the air. Those performance issues are solved
>> by design in SRC/SSQ.
>>
>>     
> Didn't know about RD limitation. Is this shortcomings of IB spec or
> general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ.
>   

The RD limitation is part of the IB spec.

>   
>>> It is a part of Spec after all, so why to invent new shiny 
>>> staff when it is still possible to achieve better scalability 
>>> without them).
>>>       
>> It's truly about complexity. And as I mentioned in OFA meeting at
>> Sonoma, 
>> Mellanox is willing to contribute SRC/SSQ to the IB spec as well.
>>
>>     
>>> We are discussing you implementation proposal and in my 
>>> opinion it doesn't fit application needs. I may be wrong 
>>> here, so if there is somebody who things that sending random 
>>> completion to random processes it the best idea ever and 
>>> absence of this "feature" is the only thing that stops him 
>>> from IB adoption he may chime in here and voice his opinion.
>>>       
>> Your input about how to demultiplex send completions on SSQ is 
>> valuable. Unfortunately it is not supported in the current generation.
>> What I can suggest here is, not new on this thread, but:
>> 1) all pollers see the same CQ, only the poller that sees the completion
>> that
>>       belongs to takes it out of the CQ
>>     
> Progress of one process depend on all other processes on the same node. Not
> good at all.
>   
In MPI, it happens many times that all processes depends on each other 
to make forward progress, this way or the other. I am not saying that 
this is the ideal solution, but there is some price involved in sharing 
resources. You can always upgrade resources for a process that utilizes 
them, e.g. if communication pattern is that each process talks with 4 
neighbors, then let it has dedicated unshared QPs.
>   
>> 2) only one process polls the CQ, if it doesn't belong to the poller,
>> the
>>       poller will put it in a SW queue to the right process. The other 
>>       processes just poll on the SW queue
>>     
> Not good of the same reason.
>
> As the variant each process can poll HW CQ and SW CQ if completion from HW CQ
> belong to another process put it on appropriate SW CQ. I don't think
> that reasonable API will require such afford from applications (and I am
> not talking about all locking overhead and cache bouncing that will
> result from such implementation, but latency will be bad that's for sure).
>   
I don't think that polling on SQ completions are in the latency path. 
You usually need it in order to free networking buffers. In any case I 
understand your point.
>   
>> 3) the SQ will have a "completed WQE index" reported. Everybody can
>>      look at it and determine how many WQEs completed. This one has
>>      some cons because the CQ is not shared here... need to bake this 
>>      one more.
>>     
> And where application will get WC? Or should it maintain its own queue
> of WQEs?
>   
In this method, each app should have its own queue.
>   
>> If we wrap one of these into the right API, once there is HW available
>> that 
>> can do the SSQ CQ demultiplexing, it can work without any API change. 
>>
>>     
> That is something I don't see in proposed API.
>
>   
>>> Looking at the Dror's slides on slide 6 "Scalable Reliable 
>>> Connection" I see that wire protocol is extended to send DST 
>>> SRQ as part of a header.
>>> Receiver side then puts completion to appropriate CQ 
>>> according this field. Have you proposition address this? How? 
>>>       
>> SRC indeed includes demultiplexing of the CQ. SSQ does not currently,
>> unfortunately.
>>     
> Is it possible to add this only with FW upgrade?
>   
Unfortunately no.
>   
>> But I think that with the right API we can abstract this, and later on
>> have better performance for it.
>>
>>     
>>> Who will put this additional data on a wire (HW or libibverbs 
>>> may be app)? Also I don't see this in Dror's slide, but 
>>> completion of local operation should be demultiplexed to 
>>> appropriate CQ too. WQE may contain additional field, for 
>>> instance, that will tell where to put a completion. Once 
>>> again who will do the demux in you proposition (HW, libiverbs 
>>> or app)? The right answer is most certainly HW in both cases 
>>> so will Hermon support this?
>>> Or may be you want to demultiplex everything inside 
>>> libibvers? In this case I want to see design of this 
>>> (preferably with performance analysis).
>>>       
>> One thing to mention. The way I see it is according to the order of the
>> slides. First get SRC going, improve the scalability. Then SSQ can be
>> added to further improve scalability. In other words I am suggesting
>> that maybe we can worry with the SSQ deficiencies a bit later :)
>>
>>     
> That is my point! Let's do it once lets do it right and lets do it when HW
> is ready :)
>   
SRC is ready in HW, it can be implemented in SW now and will 
significantly help scalability.
We can resume SSQ discussion or other alternatives later on...
> --
> 			Gleb.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>   




More information about the general mailing list