[openib-general] Updated IPoIB IETF WG presentation

Hal Rosenstock halr at voltaire.com
Mon Aug 2 06:59:25 PDT 2004


Hi David,

David M. Brean wrote:
> Hello,
>
> Some comments/questions about these slides:
>
> * slide 1 - nit: perhaps the title should be "Some Experience with
> Linux IPoIB Implementations" since the information is coming from
> Linux
> developers.

Good point.

> * slide 4 - nit: move the first bullet after bullet containing "single
> implementation"

I reordered the bullets as suggested.

> * slide 6 - nit: first bullet should be highlighted as the "problem"
> and
> the second bullet as the "solution".

Done.

> * slide 7 and 8 - In section 5.0 of the I-D, there is text stating
> that
> the "broadcast group may be created by the first IPoIB node to be
> initialized or it can be created administratively before the IPoIB
> subnet is setup".  The mechanism used to administratively create the
> group is intentionally beyond the scope of the I-D.  For example, an
> implementation could enable the fabric (or "network" as you say)
> administrator to control membership in a partition and therefore make
> sure that the first node added to that partition creates the broadcast
> group correctly.  In any case, mentioning the administrative option is
> kinda a "helpful" hint.  All the IPoIB nodes are free to create the
> broadcast group, just like they can create any multicast group, as
> long
> as the IPoIB node has enough information to specify the necessary
> parameters as required by the SA interface.  The I-D suggests how to
> find the necessary parameters for the multicast groups and leaves open
> how IPoIB nodes obtain that information if they need to create that
> group.
>
>    Are these slides suggesting that the I-D be changed to specify the
> IPoIB parameters via defaults for the case where the IPoIB node must
> create the broadcast group?

>From the discussion on the group, it was stated that some may
have interpreted the spec as requiring the pre-administered groups
and not supporting the end node creation of a group (even
the broadcast group if not already present)
(at least that's the way at least two were implemented).
This may not be an issue any more but I have not seen this stated
explicitly on this email list.

Yes, it might be good (to eliminate the need for explicit configuration)
to select a specific controlled QKey as a default for the end node case.

>    [Note, Q_Key is provided by broadcast group, so it isn't necessary
> to distribute to all IPoIB nodes.]

Are you referring to "It is RECOMMENDED that a controlled Q_Key be used with
the
high order bit set." for the broadcast group (and all other groups
using the broadcast group parameters) ?

Aren't there many controlled QKeys so this still needs configuration
somewhere (either at the SM/SA or at at least one end node (if all the
others
join rather than create the broadcast group (otherwise all end nodes if they
all attempt to create this group when not present)) ?

> * slide 9 and 10 - "Running" may be the description of a state that is
> be OS is beyond the scope of the I-D (does Windows network interface
> support a "running" state?).  However, the I-D does say that an IPoIB
> link is "formed" only when the broadcast group exists.  The I-D
> doesn't
> say anything about operation in a "degraded" mode, for example, when a
> IPoIB node can't join a multicast group.  Behavior in degraded mode
> seems like an implementation issue.  It's not clear what you would
> want
> to change in the I-D, perhaps you can suggest what you want changed in
> the presentation.

I added in a bullet on interface state being OS specific.

What I was wondering about (due to the implementations not currently
dealing with the failure modes) was:

Is the statement "an IPoIB link is "formed" only when the broadcast group
exists" sufficient for an IPoIB node failing to join the broadcast group ?

Perhaps it should state "From the IPoIB node perspective, the node is not
part of the IPoIB link until (at least) the broadcast group is successfully
joined" as well.

> * slide 12 - I recall that during the email discussion:
> 1) a boot-time scenario where the IPoIB nodes had to access the SA to
> obtain pathrecord information to fill the pathrecord cache and send
> unicast ARP messages

I didn't mention this one in the presentation although it is mentioned in
bullet which states
"Only if node has talked with other node (and cached information); otherwise
SA interaction is currently needed"

> 2) a SM failover/restart scenario
>
>    For #1, the speed at which the IPoIB nodes can begin normal
> operation depends on the fabric and SA implementation.  I guess the
> question is
> whether this is an architecture or implementation problem.  Is it
> impossible to implement a working system based on the current
> architecture?  I think the proposed alternative would require changes
> to
> the encapsulation scheme plus specifying some defaults such as the SL
> so
> that SA queries are eliminated.  Some of that might require input from
> the IBTA.
>
>    For #2, how long is too long for a subnet to operate without
> successful SA queries?  10 seconds?  20 seconds?

Don't know. Perhaps there are some on this list with opinions on this.

> Or is this change
> suggesting that the subnet should continue operating, perhaps
> establishing new IP connections (note, this proposal doesn't attempt
> to
> fix the situation at the IB transport level) even in the case where no
> SA exists.  Please clarify in the slides.

The intent is to continue operation for all IPoIB nodes currently
on the subnet (in the absence of any changes) when in the window
when no SM/SA exists.

> * slide 13 - An IB CA should perform as well as a "dumb" ethernet NIC
> with respect to bandwidth and CPU utilization.  If not, someone should
> look at the overheads in the IB access layer and the CA
> implementation, right?  The statement "not equivalent to ethernet" is
> highlighting the
> lack offload mechanisms in the CA such as checksum, correct?  If so,
> perhaps that point should be made explicit.

Another lack of clarity. I did mean "dumb" ethernet and not anything more
sophisticated with checksum offload, etc. That's a separate issue.
I made this into 2 slides in the next version of this presentation.

> Note, I'm not attempting to respond to the issues raised on the slides
> since that will happen at the meeting, but merely seeking
> clarification
> of the issues being raised.

Understood. Thanks for your comments. I think the (hopefully) added
clarity will help.

-- Hal




More information about the general mailing list