[openib-general] Updated IPoIB IETF WG presentation

David M. Brean David.Brean at Sun.COM
Tue Aug 3 01:43:58 PDT 2004


Hello,

Hal Rosenstock wrote:

> Hi David,
> 
> David M. Brean wrote:
> 
>>Hello,
>>
>>Some comments/questions about these slides:
>>
>>* slide 1 - nit: perhaps the title should be "Some Experience with
>>Linux IPoIB Implementations" since the information is coming from
>>Linux
>>developers.
> 
> 
> Good point.
> 
> 
>>* slide 4 - nit: move the first bullet after bullet containing "single
>>implementation"
> 
> 
> I reordered the bullets as suggested.
> 
> 
>>* slide 6 - nit: first bullet should be highlighted as the "problem"
>>and
>>the second bullet as the "solution".
> 
> 
> Done.
> 
> 
>>* slide 7 and 8 - In section 5.0 of the I-D, there is text stating
>>that
>>the "broadcast group may be created by the first IPoIB node to be
>>initialized or it can be created administratively before the IPoIB
>>subnet is setup".  The mechanism used to administratively create the
>>group is intentionally beyond the scope of the I-D.  For example, an
>>implementation could enable the fabric (or "network" as you say)
>>administrator to control membership in a partition and therefore make
>>sure that the first node added to that partition creates the broadcast
>>group correctly.  In any case, mentioning the administrative option is
>>kinda a "helpful" hint.  All the IPoIB nodes are free to create the
>>broadcast group, just like they can create any multicast group, as
>>long
>>as the IPoIB node has enough information to specify the necessary
>>parameters as required by the SA interface.  The I-D suggests how to
>>find the necessary parameters for the multicast groups and leaves open
>>how IPoIB nodes obtain that information if they need to create that
>>group.
>>
>>   Are these slides suggesting that the I-D be changed to specify the
>>IPoIB parameters via defaults for the case where the IPoIB node must
>>create the broadcast group?
> 
> 
>>From the discussion on the group, it was stated that some may
> have interpreted the spec as requiring the pre-administered groups
> and not supporting the end node creation of a group (even
> the broadcast group if not already present)
> (at least that's the way at least two were implemented).
> This may not be an issue any more but I have not seen this stated
> explicitly on this email list.
> 

The slides quote text from the I-D that says that IPoIB node should
create group if it doesn't exist and use parameters from the broadcast
group.  What additional clarification is needed?

By the way, the I-D is written to be consistent with the language in the
IB specification and that is why JOIN and CREATE are separately
described.  However, JOIN and CREATE can be done in one SA operation and
that operation has been described on this reflector.

> Yes, it might be good (to eliminate the need for explicit configuration)
> to select a specific controlled QKey as a default for the end node case.
> 
> 
>>   [Note, Q_Key is provided by broadcast group, so it isn't necessary
>>to distribute to all IPoIB nodes.]
> 
> 
> Are you referring to "It is RECOMMENDED that a controlled Q_Key be used with
> the
> high order bit set." for the broadcast group (and all other groups
> using the broadcast group parameters) ?
> 
> Aren't there many controlled QKeys so this still needs configuration
> somewhere (either at the SM/SA or at at least one end node (if all the
> others
> join rather than create the broadcast group (otherwise all end nodes if they
> all attempt to create this group when not present)) ?
> 

Section 5.0 of the latest I-D says "The join operation (using the
broadcast group) returns the MTU, the Q_Key and other parameters
associated with the broadcast group. The node then associates the
parameters received as a result of the join operation with its IPoIB
interface." and in section 9.1.2 it says "The Q_Key received on joining
the broadcast group MUST be used for all IPoIB communication over the
particular IPoIB link."

So, for a particular IPoIB link there is one Q_Key and there should be
no need for explicit configuration on each IPoIB node except in the case
of the broadcast group creation.  Selection of the Q_Key value is
left to the administrator, but the I-D recommends using one in the
controlled range.  So, I don't think there is a separate Q_Key 
distribution problem in addition to the broadcast group problem 
mentioned in the slides.

> 
>>* slide 9 and 10 - "Running" may be the description of a state that is
>>be OS is beyond the scope of the I-D (does Windows network interface
>>support a "running" state?).  However, the I-D does say that an IPoIB
>>link is "formed" only when the broadcast group exists.  The I-D
>>doesn't
>>say anything about operation in a "degraded" mode, for example, when a
>>IPoIB node can't join a multicast group.  Behavior in degraded mode
>>seems like an implementation issue.  It's not clear what you would
>>want
>>to change in the I-D, perhaps you can suggest what you want changed in
>>the presentation.
> 
> 
> I added in a bullet on interface state being OS specific.
> 
> What I was wondering about (due to the implementations not currently
> dealing with the failure modes) was:
> 
> Is the statement "an IPoIB link is "formed" only when the broadcast group
> exists" sufficient for an IPoIB node failing to join the broadcast group ?
> 
> Perhaps it should state "From the IPoIB node perspective, the node is not
> part of the IPoIB link until (at least) the broadcast group is successfully
> joined" as well.
> 

Section 5 describes the IPoIB link setup.  It says that "Every IPoIB
interface MUST "FullMember" join the IB multicast group defined by the
broadcast-GID." and later says "Thus the IPoIB link is formed by
the IPoIB nodes joining the broadcast group."

I interpreted the problems discussed in the email as being related to
unclear behavior of communication when IPoIB is operating in a degraded
mode.  The I-D doesn't attempt to describe that (in my opinion), but I
not sure that it needs to.  Perhaps that's vendor value add.

> 
>>* slide 12 - I recall that during the email discussion:
>>1) a boot-time scenario where the IPoIB nodes had to access the SA to
>>obtain pathrecord information to fill the pathrecord cache and send
>>unicast ARP messages
> 
> 
> I didn't mention this one in the presentation although it is mentioned in
> bullet which states
> "Only if node has talked with other node (and cached information); otherwise
> SA interaction is currently needed"
> 

Well, I mention this scenario because the implication is that the
current mechanism does not scale.  I don't recall any comments on the
reflector about performance problems under normal operating conditions.

> 
>>2) a SM failover/restart scenario
>>
>>   For #1, the speed at which the IPoIB nodes can begin normal
>>operation depends on the fabric and SA implementation.  I guess the
>>question is
>>whether this is an architecture or implementation problem.  Is it
>>impossible to implement a working system based on the current
>>architecture?  I think the proposed alternative would require changes
>>to
>>the encapsulation scheme plus specifying some defaults such as the SL
>>so
>>that SA queries are eliminated.  Some of that might require input from
>>the IBTA.
>>
>>   For #2, how long is too long for a subnet to operate without
>>successful SA queries?  10 seconds?  20 seconds?
> 
> 
> Don't know. Perhaps there are some on this list with opinions on this.
> 
> 
>>Or is this change
>>suggesting that the subnet should continue operating, perhaps
>>establishing new IP connections (note, this proposal doesn't attempt
>>to
>>fix the situation at the IB transport level) even in the case where no
>>SA exists.  Please clarify in the slides.
> 
> 
> The intent is to continue operation for all IPoIB nodes currently
> on the subnet (in the absence of any changes) when in the window
> when no SM/SA exists.
> 

Yes, but the duration of the window depends on the SM implementation and
fabric configuration.  If you are going to suggest that the protocol be
redesigned, then you need to explain how the architecture is
unimplementable.  [The alternative represents a significant change at
this point ant the proposals that I've seen are insufficient.]

-David

> 
>>* slide 13 - An IB CA should perform as well as a "dumb" ethernet NIC
>>with respect to bandwidth and CPU utilization.  If not, someone should
>>look at the overheads in the IB access layer and the CA
>>implementation, right?  The statement "not equivalent to ethernet" is
>>highlighting the
>>lack offload mechanisms in the CA such as checksum, correct?  If so,
>>perhaps that point should be made explicit.
> 
> 
> Another lack of clarity. I did mean "dumb" ethernet and not anything more
> sophisticated with checksum offload, etc. That's a separate issue.
> I made this into 2 slides in the next version of this presentation.
> 
> 
>>Note, I'm not attempting to respond to the issues raised on the slides
>>since that will happen at the meeting, but merely seeking
>>clarification
>>of the issues being raised.
> 
> 
> Understood. Thanks for your comments. I think the (hopefully) added
> clarity will help.
> 
> -- Hal
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general







More information about the general mailing list