[openib-general] [PATCH] [CM] add private data comparison tomatch REQs with listens

Fri Dec 2 15:01:47 PST 2005

Rimmer, Todd wrote:
>> -----Original Message-----
>> From: Tillier, Fabian
>> Sent: Friday, December 02, 2005 5:21 PM
>> To: 'Caitlin Bestler'; 'Sean Hefty'
>> Cc: openib-general at openib.org
>> Subject: RE: [openib-general] [PATCH] [CM] add private data
>> comparison tomatch REQs with listens 
>> 
>> 
>>> From: Caitlin Bestler [mailto:caitlinb at broadcom.com]
>>> Sent: Friday, December 02, 2005 12:13 PM
>>> 
>>> Sean Hefty wrote:
>>>> Fab Tillier wrote:
>>>>>> Just listen on the Service ID / Port and let the ULP sort them
>>>>>> out by destination IP address.
>>>>> 
>>>>> That only works if there is a single kernel module providing the
>>>>> extra checks. Multiple user-mode ULPs cannot do the checking in
>>>>> user-mode - the checking must be done in the kernel to figure out
>>>>> which user-mode client to hand the request to.
>>>>> 
>>>>> I think putting in restrictions to the comparisons possible is
>>>>> fine, as the functionality of having the CM facilitate some sort
>>>>> of filtering is useful.
>>>> 
>>>> My concern with pushing this to the ULP is that it requires the
>>>> ULP to track service IDs for reference counting purposes and adds
>>>> additional synchronization to the ULP that could have been handled
>>>> by the CM. 
>>>> 
>>>> I'm looking at what the full effect of implementing this in the ULP
>>>> would be.
>>> 
>>> I'm still missing something.
>>> 
>>> I don't see how filtering in the CM is of benefit in either case.
>>> The work either belongs in the Hypervisor or in the Daemon, not the
>>> CM. 
>> 
>> Your focus is strictly on TCP socket semantics, but we're talking
>> about IB CM functionality - the IB CM does more than just provide
>> TCP socket semantics. 
>> 
>> Imagine a user-mode IB application (not virtualization mind you, but
>> just an app) that wants to listen on a given SID (because the SID
>> defines the application), but wants to discriminate incoming
>> requests based on some content in the private data.  Multiple
>> instances of that application can only work properly if the CM
>> performs the private data comparison to properly dispatch the
>> incoming requests to the right user-mode process. 
>> 
>> If the CM doesn't provide the private data compare functionality,
>> then the app developer needs to create a kernel agent to perform this
>> functionality for the app.  The functionality is simple enough, and
>> has potential value to multiple clients, that it makes sense to have
>> the IB CM provide it. 
>> 
>> - Fab
> 
> I agree, to give you a good practical example, MPI needs to
> listen for incoming connections.
> 
> It is wasteful to have MPI create separate SIDs for each rank
> (especially when there can be thousands of ranks in many jobs
> all running in the same cluster parts of which on the same
> node) and then listen on 1000s of SIDs in each process.
> 
> Instead it makes sense to use a single SID for the entire job
> (possibly using the global Job ID as part of the SID), and
> have the private data of the REQ indicate the destination
> rank of the request.  Then each rank in the MPI job can
> listen for the combination of the global Job ID's SID and
> private data where the destination rank matches itself (using
> 1 listening CEP per process) and let the CM filter by both
> criteria and deliver the REQs to the appropriate processes.
> 
> The above scheme works very well and minimizes CM resource
> use for large MPI jobs.
> 
> I'm sure other interesting and useful examples can be found as well.
> 

MPI works over plain TCP right now, and yet there is no such
feature in INETD or in current socket listens. And they do not
allocate a TCP Port to listen for each connection. Rather the
same listen just accepts each connection and either creates
the process or passes the handle to a process.

There are many reasons why an established RDMA connection 
cannot be passed between processes, but I know of know 
reason why a Connection Request cannot be passed to a child
or third process where it can be accepted.

Why not emulate the existing solution rather than creating
a new interface that is transport specific?

Or conversely, if you truly think this is of general utility,
why not implement it in INETD as well?