[ofa-general] Re: Re: [PATCH RFC] sharing userspace IB objects
Dror Goldenberg
gdror at dev.mellanox.co.il
Mon Jul 2 04:00:56 PDT 2007
Gleb Natapov wrote:
> On Sun, Jul 01, 2007 at 07:27:24PM +0300, Dror Goldenberg wrote:
>
>>> SSQ is needed for scalability, no need to explain this (by
>>> the way RD is needed for the same reason too. What's Mellanox
>>> plan to support it?
>>>
>> RD is not supported in hardware today. Implementing RD is extremely
>> complicated. To solve the scalability issues on MPI like applications
>> we believe that SRC and SSQ are the right solutions. It is much simpler
>> for implementation by both software and hardware. By MPI-like I refer
>> to applications that have some level of trust between two processes of
>> the
>> same application. RD also has some performance issues as it only
>> supports one message in the air. Those performance issues are solved
>> by design in SRC/SSQ.
>>
>>
> Didn't know about RD limitation. Is this shortcomings of IB spec or
> general limitation of reliable datagram? RD looks much nice to me then SRC/SSQ.
>
The RD limitation is part of the IB spec.
>
>>> It is a part of Spec after all, so why to invent new shiny
>>> staff when it is still possible to achieve better scalability
>>> without them).
>>>
>> It's truly about complexity. And as I mentioned in OFA meeting at
>> Sonoma,
>> Mellanox is willing to contribute SRC/SSQ to the IB spec as well.
>>
>>
>>> We are discussing you implementation proposal and in my
>>> opinion it doesn't fit application needs. I may be wrong
>>> here, so if there is somebody who things that sending random
>>> completion to random processes it the best idea ever and
>>> absence of this "feature" is the only thing that stops him
>>> from IB adoption he may chime in here and voice his opinion.
>>>
>> Your input about how to demultiplex send completions on SSQ is
>> valuable. Unfortunately it is not supported in the current generation.
>> What I can suggest here is, not new on this thread, but:
>> 1) all pollers see the same CQ, only the poller that sees the completion
>> that
>> belongs to takes it out of the CQ
>>
> Progress of one process depend on all other processes on the same node. Not
> good at all.
>
In MPI, it happens many times that all processes depends on each other
to make forward progress, this way or the other. I am not saying that
this is the ideal solution, but there is some price involved in sharing
resources. You can always upgrade resources for a process that utilizes
them, e.g. if communication pattern is that each process talks with 4
neighbors, then let it has dedicated unshared QPs.
>
>> 2) only one process polls the CQ, if it doesn't belong to the poller,
>> the
>> poller will put it in a SW queue to the right process. The other
>> processes just poll on the SW queue
>>
> Not good of the same reason.
>
> As the variant each process can poll HW CQ and SW CQ if completion from HW CQ
> belong to another process put it on appropriate SW CQ. I don't think
> that reasonable API will require such afford from applications (and I am
> not talking about all locking overhead and cache bouncing that will
> result from such implementation, but latency will be bad that's for sure).
>
I don't think that polling on SQ completions are in the latency path.
You usually need it in order to free networking buffers. In any case I
understand your point.
>
>> 3) the SQ will have a "completed WQE index" reported. Everybody can
>> look at it and determine how many WQEs completed. This one has
>> some cons because the CQ is not shared here... need to bake this
>> one more.
>>
> And where application will get WC? Or should it maintain its own queue
> of WQEs?
>
In this method, each app should have its own queue.
>
>> If we wrap one of these into the right API, once there is HW available
>> that
>> can do the SSQ CQ demultiplexing, it can work without any API change.
>>
>>
> That is something I don't see in proposed API.
>
>
>>> Looking at the Dror's slides on slide 6 "Scalable Reliable
>>> Connection" I see that wire protocol is extended to send DST
>>> SRQ as part of a header.
>>> Receiver side then puts completion to appropriate CQ
>>> according this field. Have you proposition address this? How?
>>>
>> SRC indeed includes demultiplexing of the CQ. SSQ does not currently,
>> unfortunately.
>>
> Is it possible to add this only with FW upgrade?
>
Unfortunately no.
>
>> But I think that with the right API we can abstract this, and later on
>> have better performance for it.
>>
>>
>>> Who will put this additional data on a wire (HW or libibverbs
>>> may be app)? Also I don't see this in Dror's slide, but
>>> completion of local operation should be demultiplexed to
>>> appropriate CQ too. WQE may contain additional field, for
>>> instance, that will tell where to put a completion. Once
>>> again who will do the demux in you proposition (HW, libiverbs
>>> or app)? The right answer is most certainly HW in both cases
>>> so will Hermon support this?
>>> Or may be you want to demultiplex everything inside
>>> libibvers? In this case I want to see design of this
>>> (preferably with performance analysis).
>>>
>> One thing to mention. The way I see it is according to the order of the
>> slides. First get SRC going, improve the scalability. Then SSQ can be
>> added to further improve scalability. In other words I am suggesting
>> that maybe we can worry with the SSQ deficiencies a bit later :)
>>
>>
> That is my point! Let's do it once lets do it right and lets do it when HW
> is ready :)
>
SRC is ready in HW, it can be implemented in SW now and will
significantly help scalability.
We can resume SSQ discussion or other alternatives later on...
> --
> Gleb.
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>
More information about the general
mailing list