[openib-general] Updated IPoIB IETF WG presentation

Hal Rosenstock halr at voltaire.com
Tue Aug 3 07:00:09 PDT 2004


David M. Brean wrote:
>> Hal Rosenstock wrote:
>> From the discussion on the group, it was stated that some may
>> have interpreted the spec as requiring the pre-administered groups
>> and not supporting the end node creation of a group (even
>> the broadcast group if not already present)
>> (at least that's the way at least two were implemented).
>> This may not be an issue any more but I have not seen this stated
>> explicitly on this email list.
>>
>
> The slides quote text from the I-D that says that IPoIB node should
> create group if it doesn't exist and use parameters from the broadcast
> group.  What additional clarification is needed?

IMO none. As I mentioned, some others in this group were unsure.

> By the way, the I-D is written to be consistent with the language in
> the IB specification and that is why JOIN and CREATE are separately
> described.  However, JOIN and CREATE can be done in one SA operation
> and that operation has been described on this reflector.

Understood.

>> Yes, it might be good (to eliminate the need for explicit
>> configuration)
>> to select a specific controlled QKey as a default for the end node
>> case.
>>
>>
>>>   [Note, Q_Key is provided by broadcast group, so it isn't necessary
>>> to distribute to all IPoIB nodes.]
>>
>>
>> Are you referring to "It is RECOMMENDED that a controlled Q_Key be
>> used with the
>> high order bit set." for the broadcast group (and all other groups
>> using the broadcast group parameters) ?
>>
>> Aren't there many controlled QKeys so this still needs configuration
>> somewhere (either at the SM/SA or at at least one end node (if all
>> the
>> others
>> join rather than create the broadcast group (otherwise all end nodes
>> if they all attempt to create this group when not present)) ?
>>
>
> Section 5.0 of the latest I-D says "The join operation (using the
> broadcast group) returns the MTU, the Q_Key and other parameters
> associated with the broadcast group. The node then associates the
> parameters received as a result of the join operation with its IPoIB
> interface." and in section 9.1.2 it says "The Q_Key received on
> joining
> the broadcast group MUST be used for all IPoIB communication over the
> particular IPoIB link."
>
> So, for a particular IPoIB link there is one Q_Key and there should be
> no need for explicit configuration on each IPoIB node except in the
> case
> of the broadcast group creation.  Selection of the Q_Key value is
> left to the administrator, but the I-D recommends using one in the
> controlled range.  So, I don't think there is a separate Q_Key
> distribution problem in addition to the broadcast group problem
> mentioned in the slides.

Agreed. Was this mentioned somewhere else in the slides ?

>>> * slide 9 and 10 - "Running" may be the description of a state that
>>> is be OS is beyond the scope of the I-D (does Windows network
>>> interface support a "running" state?).  However, the I-D does say
>>> that an IPoIB link is "formed" only when the broadcast group
>>> exists.  The I-D doesn't
>>> say anything about operation in a "degraded" mode, for example,
>>> when a IPoIB node can't join a multicast group.  Behavior in
>>> degraded mode seems like an implementation issue.  It's not clear
>>> what you would want
>>> to change in the I-D, perhaps you can suggest what you want changed
>>> in the presentation.
>>
>>
>> I added in a bullet on interface state being OS specific.
>>
>> What I was wondering about (due to the implementations not currently
>> dealing with the failure modes) was:
>>
>> Is the statement "an IPoIB link is "formed" only when the broadcast
>> group exists" sufficient for an IPoIB node failing to join the
>> broadcast group ?
>>
>> Perhaps it should state "From the IPoIB node perspective, the node
>> is not part of the IPoIB link until (at least) the broadcast group
>> is successfully joined" as well.
>>
>
> Section 5 describes the IPoIB link setup.  It says that "Every IPoIB
> interface MUST "FullMember" join the IB multicast group defined by the
> broadcast-GID." and later says "Thus the IPoIB link is formed by
> the IPoIB nodes joining the broadcast group."
>
> I interpreted the problems discussed in the email as being related to
> unclear behavior of communication when IPoIB is operating in a
> degraded
> mode.  The I-D doesn't attempt to describe that (in my opinion), but I
> not sure that it needs to.  Perhaps that's vendor value add.

There were 2 aspects to the degraded operation. One was related
to critical groups (like the broadcast group, which is covered by
the statement you cite) and non critical ones. There was an issue
when the broadcast group could not be joined. There was also
the issue of whether any other groups are "critical" or is the broadcast
group the only one.

>>> * slide 12 - I recall that during the email discussion:
>>> 1) a boot-time scenario where the IPoIB nodes had to access the SA
>>> to obtain pathrecord information to fill the pathrecord cache and
>>> send unicast ARP messages
>>
>>
>> I didn't mention this one in the presentation although it is
>> mentioned in bullet which states
>> "Only if node has talked with other node (and cached information);
>> otherwise SA interaction is currently needed"
>>
>
> Well, I mention this scenario because the implication is that the
> current mechanism does not scale.  I don't recall any comments on the
> reflector about performance problems under normal operating
> conditions.

I believe boot up has been mentioned by some people on the list. I perceive
this as an SA performance issue in not being able to keep up with the
transaction
rate in a large cluster.

Do you think I should add this as a performance concern (not an IPoIB one,
but related to IPoIB) ?

>>> 2) a SM failover/restart scenario
>>>
>>>   For #1, the speed at which the IPoIB nodes can begin normal
>>> operation depends on the fabric and SA implementation.  I guess the
>>> question is
>>> whether this is an architecture or implementation problem.  Is it
>>> impossible to implement a working system based on the current
>>> architecture?  I think the proposed alternative would require
>>> changes to
>>> the encapsulation scheme plus specifying some defaults such as the
>>> SL so
>>> that SA queries are eliminated.  Some of that might require input
>>> from the IBTA.
>>>
>>>   For #2, how long is too long for a subnet to operate without
>>> successful SA queries?  10 seconds?  20 seconds?
>>
>>
>> Don't know. Perhaps there are some on this list with opinions on
>> this.
>>
>>
>>> Or is this change
>>> suggesting that the subnet should continue operating, perhaps
>>> establishing new IP connections (note, this proposal doesn't attempt
>>> to
>>> fix the situation at the IB transport level) even in the case where
>>> no SA exists.  Please clarify in the slides.
>>
>>
>> The intent is to continue operation for all IPoIB nodes currently
>> on the subnet (in the absence of any changes) when in the window
>> when no SM/SA exists.
>>
>
> Yes, but the duration of the window depends on the SM implementation
> and fabric configuration.  If you are going to suggest that the
> protocol be redesigned, then you need to explain how the architecture
> is
> unimplementable.  [The alternative represents a significant change at
> this point ant the proposals that I've seen are insufficient.]

Agreed. Much more work (and detail) needs to be done here.
I still think it is worth mentioning to plant the seed.

-- Hal




More information about the general mailing list