[openib-general] FW: [PATCH] [RFC] librdmacm: expose device list to users

Tue Jul 25 18:40:24 PDT 2006

>I'd prefer a raw IB interface - like you said, this isn't really
>analogous to IP, and I'd like to avoid the other non-multicast issue I
>have with RDMA CM.  Also, when I first started looking at IB multicast I
>was expecting this to be part of the ibverbs interface, not a CM.

IB multicast is more of an SA interface, rather than IB verbs or IB CM.  I've
already started looking at bringing the kernel ib_multicast interface up into
userspace.

>> In order to use what's there, is there any way that the processes can
>> create unique addresses to use?  Maybe map the server port numbers into
>> the address?
>
>Not sure I understand what you're asking.. addresses to use with what?

I was trying to ask if there was any way for the processes to generate unique
addresses.  For example, what TCP port number do the processes listen on when
establishing their out of band connections?  Is there some way that you can map
the addresses that are used for out of band communication to a multicast IP
address, such that the processes get unique addresses?  From reading down into
your mail, it doesn't sound like this would help much.

>Part of why I defined ReturnMulticastAddress the way I did was because I
>thought it would be useful to hold on to multicast groups without having
>any peers joined.  These could be kept in a pool for re-use, and have
>peers join/leave them as needed.  The MVAPICH group wrote a paper on a
>similar idea, where they keep a pool of groups with all peers joined,
>then any peers not interested in communication when a group is pulled
>from the pool can pull out.  But if the time cost is in the join and not
>the initial creation, this doesn't solve anything.

The only standard defined way to allocate an IB multicast address (i.e. MGID),
is for someone to create the multicast group.  This does an implicit join by the
creator.  In IB, the cost is in the join, since it requires programming the
switches.  The group will continue to exist as long as someone remains in the
group.

>ib_multicast.h looks good.. lots of functionality packed into very few
>functions.  I don't see any problems with it... yet :)

I think the same basic API can be exposed in userspace.  It may be possible to
expose a couple of extra helper functions to simplify creating and joining a
group, but I'm not sure if they will be worth it.

>I like the callback on join completion, as opposed to polling somewhere.

This doesn't end up working well for userspace apps.  To get a callback, the
library ends up needing to create a thread to poll for events from the kernel.
It makes more sense to give the application control over the threading, and let
it poll for the events.

>The comments don't say anything about passing an MGID of 0 in - I assume
>this functionality will be there.  Would I pass an MLID of 0 as well, or
>do I need to come up with a valid MLID from somewhere?

Well, after looking at the code, an MGID of 0 doesn't currently work.  The
implementation doesn't handle it.  I worked on a design to add support for MGID
0 to the multicast module, and will start on it in the next day or so.

Another thought I had is to allow ib_get_mcmember_rec() be called with an MGID
of 0.  Doing so would return an MCMemberRecord with reasonable default values
that could be used when creating a group.  (The returned values would either be
hard-coded or copy those from the first join on a given port, if one had
occurred.  In almost all cases, the first join would come from ipoib.)

>Just to make sure, if I pass in an MGID of 0, an MGID will not only be
>allocated, but joined as well?

Correct

>Again to be clear, ib_free_multicast() will leave the multicast group in
>question?

Correct - the function is called "free" instead of "leave" because it must be
called even if the join request failed, and may be called if the join operation
has not yet completed.

>Is ib_get_mcmember_rec the interface you mentioned for determining
>whether a port is already in a multicast group?

Yes - but it requires that you already know which group (MGID) is being joined.

>Just thought of a feature that would be nice.  As-is, I have no idea
>when all peers intending to join a multicast group have done so.  What
>would be nice is some sort of notification mechanism - say the ability
>to provide a callback that is called each time a peer joins a multicast
>group.  I already know which peers I expect to join, so I can keep a
>list of which ones have/haven't joined, and mark the multicast group as
>useable when all the expected peers are joined.
>
>Would this be reasonable?  The alternative for me would be for each peer
>to send messages OOB to every other peer in the multicast group when it
>has successfully joined.

There is no way to do this.  Note that there may be a delay between a node
joining a group and the programming of the switch tables.

- Sean