[ofa-general] SubnAdmGet (6777)

Hal Rosenstock hal.rosenstock at gmail.com
Wed Jun 3 11:08:41 PDT 2009


On Wed, Jun 3, 2009 at 1:11 PM, Bob Ciotti <Bob.Ciotti at nasa.gov> wrote:
> On Wed, Jun 03, 2009 at 06:03:50AM -0500, Eli Dorfman (Voltaire) wrote:
>> Eli Dorfman (Voltaire) wrote:
>> > Hal Rosenstock wrote:
>> >> On Mon, Jun 1, 2009 at 5:36 PM, Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
>> >>> On Mon, Jun 1, 2009 at 4:27 PM, Sean Hefty <sean.hefty at intel.com> wrote:
>> >>>>> Yes, RMPP is an overhead when the response is a single MAD but is this
>> >>>>> significant ? Anyhow, how can the spec be changed in a way that
>> >>>>> doesn't break existing implementations ?
>> >>>> But the implementations are assuming different things about SubnAdmGet.  The SA
>> >>>> is assuming that the query should fail if multiple records match.  The client
>> >>>> side software (ipoib and rdma_cm) assume that it will obtain a single record
>> >>>> even if multiple paths are present.  So, something needs to change.
>> >>> Seems so.
>> >>>
>> >>>> The spec indicates that value in the request is ignored and NumbPath is 1, not
>> >>>> that NumbPath is completely ignored.
>> >>> For Get, it doesn't say that the matches are paired down to this
>> >>> number as it does for GetTable.
>> >>>
>> >>>>  Also see page 1242 in the SDP annex which
>> >>>> reads: 'NumbPath could be 1 (in which case the SA query may use SubnAdmGet
>> >>>> rather than SubnAdmGetTable)'.
>> >>> SDP annex is not the primary source for this (chapter 15 is) and is
>> >>> inconsistent and no one caught this.
>> >>>
>> >>>> To me, this implies that SubnAdmGet should be
>> >>>> treated equivalent as SubnAdmGetTable with NumbPath = 1.
>> >>>> It just seems really odd to treat NumbPath differently for PR SubnAdmGet versus
>> >>>> PR SubnAdmGetTable and MPR SubAdmGetMulti.  Basically, this makes PR SubnAdmGet
>> >>>> useless.
>> >>> when there's a subnet with multiple paths and the requests are not
>> >>> specific enough to use get.
>> >>>
>> >>> Seems like either the queries need to use RMPP, or the spec modified
>> >>> (if that's possible) and the SAs updated.
>> >> I sit corrected :-) Your interpretation of the spec is correct. Also,
>> >> in looking at OpenSM, the intent is as you indicate: it does try to
>> >> only return 1 attibute for get PR. If when returning the response,
>> >> there is more than 1 attribute in the list, it returns the too many
>> >> records error. There must be some code path I don't see right now
>> >> which is doing this. It would be useful to know the details of the
>> >> query (get request) causing this.
>> >>
>> >
>> > This may happen when pr_rcv_get_port_pair_paths() is called several times.
>> > The only case i see is pr_rcv_process_world() that means the request is without or wrong
>> > src and dest port or component mask for SGID and DGID is 0.
>>
>> correction - this may happen only when component mask for SGID and DGID is 0.
>
> Here is a mad dump of the offending sequence.
>
> Jun 02 12:43:01 355975 [3DD13940] 0x80 -> SUBNET UP
> Jun 02 12:43:03 484480 [5020B940] 0x20 -> SA MAD dump:
>                                base_ver................0x1
>                                mgmt_class..............0x3
>                                class_ver...............0x2
>                                method..................0x1 (SubnAdmGet)
>                                status..................0x0
>                                resv....................0x0
>                                trans_id................0x2b82ad0a0000
>                                attr_id.................0x11 (NodeRecord)
>                                resv1...................0x0
>                                attr_mod................0x0
>                                rmpp_version............0x0
>                                rmpp_type...............0x0
>                                rmpp_flags..............0x0
>                                rmpp_status.............0x0
>                                seg_num.................0x0
>                                payload_len/new_win.....0x0
>                                sm_key..................0x0000000000000000
>                                attr_offset.............0x0
>                                resv2...................0x0
>                                comp_mask...............0x0000000000000001

Looks like a LID match on NodeRecord is being requested but there's
more than one. Returning an error is correct for this case.

Any idea on what component is issuing the gets on NodeRecord ?

By any chance are you using LMC > 0 ?

-- Hal

> [19323940] 0x20 -> SA MAD dump:
>                                base_ver................0x1
>                                mgmt_class..............0x3
>                                class_ver...............0x2
>                                method..................0x81 (SubnAdmGetResp)
>                                status..................0x400
>                                resv....................0x0
>                                trans_id................0x2b82ad0a0000
>                                attr_id.................0x11 (NodeRecord)
>                                resv1...................0x0
>                                attr_mod................0x0
>                                rmpp_version............0x0
>                                rmpp_type...............0x0
>                                rmpp_flags..............0x0
>                                rmpp_status.............0x0
>                                seg_num.................0x0
>                                payload_len/new_win.....0x0
>                                sm_key..................0x0000000000000000
>                                attr_offset.............0x0
>                                resv2...................0x0
>                                comp_mask...............0x0000000000000001
>
>
> bob
>


More information about the general mailing list