***SPAM*** Re: [ofa-general] Any easy way to specify to the SM to route/zone?

Chris Worley worleys at gmail.com
Mon Apr 13 14:50:49 PDT 2009


On Mon, Apr 13, 2009 at 3:01 PM, Hal Rosenstock
<hal.rosenstock at gmail.com> wrote:
> On Mon, Apr 13, 2009 at 4:09 PM, Chris Worley <worleys at gmail.com> wrote:
>> On Mon, Apr 13, 2009 at 12:52 PM, Hal Rosenstock
>> <hal.rosenstock at gmail.com> wrote:
>>> On Mon, Apr 13, 2009 at 2:26 PM, Chris Worley <worleys at gmail.com> wrote:
>>>> On Mon, Apr 13, 2009 at 11:53 AM, Hal Rosenstock
>>>> <hal.rosenstock at gmail.com> wrote:
>>>>> On Mon, Apr 13, 2009 at 12:02 PM, Chris Worley <worleys at gmail.com> wrote:
>>>>>> On Mon, Apr 13, 2009 at 7:43 AM, Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
>>>>>>> On Mon, Apr 13, 2009 at 9:37 AM, Chris Worley <worleys at gmail.com> wrote:
>>>>>>>> On Mon, Apr 13, 2009 at 5:39 AM, Hal Rosenstock
>>>>>>>> <hal.rosenstock at gmail.com> wrote:
>>>>>>>>> On Sun, Apr 12, 2009 at 11:01 PM, Chris Worley <worleys at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> So I need to tell the SM to route specific ports on the server/target
>>>>>>>>>> to specific clients/initiators.
>>>>>>>>>>
>>>>>>>>>> Is there any way to do this?
>>>>>>>>>
>>>>>>>>> Do you mean restrict access between certain clients/servers ?
>>>>>>>>
>>>>>>>> One server w/ 4QDR boards, 16 clients with one QDR board.  I want each
>>>>>>>> port on the server routed/zoned to two clients.
>>>>>>>>
>>>>>>>>> If so,
>>>>>>>>> you can do this with partitioning
>>>>>>>>
>>>>>>>> What is partitioning?
>>>>>>>
>>>>>>> A partition is a collection of ports which are allowed to communicate
>>>>>>> together. There are two forms of members: full members which can talk
>>>>>>> to any other member (useful for servers) and limited members which can
>>>>>>> only talk to full members (useful for clients). See the opensm man
>>>>>>> page or partition-config.txt on setting this up for OpenSM.
>>>>>>>
>>>>>>
>>>>>> Let me see if I understand this with a simple example... my port GUIDs
>>>>>> (as reported by ibstat) are for one server (4 QDR ports) and four
>>>>>> clients (one QDR port each):
>>>>>>
>>>>>>
>>>>>> Server A:           Port GUID: 0x0024717124000029
>>>>>> Server B:           Port GUID: 0x002471712400002a
>>>>>> Server C:           Port GUID: 0x0024717127000035
>>>>>> Server D:           Port GUID: 0x0024717127000036
>>>>>>
>>>>>> Client 1:                Port GUID: 0x0002c90300028c01
>>>>>> Client 2:                Port GUID: 0x0002c90300026047
>>>>>> Client 3:                Port GUID: 0x0002c90300026053
>>>>>> Client 4:                Port GUID: 0x0002c9030002603b
>
> Is there a switch in between or just back to back HCA ports ?

Yes, there's a switch; it's not directly connected from port to port.
In the end, there will be 2 or 4 clients per server port (this simple
configuration is just to get me going), so a switch is needed.

>
>>>>>>
>>>>>> Assuming I want a 1:1 (one server port to one client) partitioning, I
>>>>>> would put the following in /etc/ofed/partitions.conf:
>>>>>>
>>>>>> part1=0x1, ipoib, defmember=full : 0x0024717124000029, 0x0002c90300028c01;
>>>>>> part2=0x2, ipoib, defmember=full : 0x002471712400002a, 0x0002c90300026047;
>>>>>> part3=0x3, ipoib, defmember=full : 0x0024717127000035, 0x0002c90300026053;
>>>>>> part4=0x4, ipoib, defmember=full : 0x0024717127000036, 0x0002c9030002603b;
>>>>>
>>>>> So you want IPoIB.
>>>>
>>>> I'm doing SRP, so I need IPoIB working.
>>>
>>> SRP needs to query PathRecord with the correct PKey and use the
>>> correct Pkey index for that partition. I'm not sure how that is done
>>> in SRP but first IPoIB needs to be made to work (again).
>>>
>>
>> Okay... I'll setup the IPoIB as the ipoib.txt suggests, i.e.:
>>
>> echo 0x1 > /sys/class/net/ib0/create_child
>>
>> ... but for now, I'm still not seeing the state go to "up"... I think
>> that's the first problem.
>
> Yes, port state needs to be linkup/active first. I see LinkUp/Armed from below.
>
>>>>>
>>>>>> ... and run w/:
>>>>>>
>>>>>> opensm -r -B -P/etc/ofed/partitions.conf
>>>
>>> Also, do you need to use -r ? It's better not to (reassign LIDs).
>>
>> I'm using it to assure that it just doesn't hang on to the old state,
>> especially since I'm not getting the SM working...
>
> OK.
>
>> I don't want it to
>> assume anything is right about the previous state.
>>
>> I have tried w/ and w/o and don't see a difference.
>>
>> The plan is, once I get it working, to remove the "-r".
>
> That's fine.
>
>>  Or, are you suggesting I not use it?
>>
>>>
>>>>>> Does that sound correct?  It doesn't work
>>>>>
>>>>> What application(s) aren't working ?
>>>>
>>>> ping over IPoIB, for example.
>>>>
>>>> I am seeing the test node in an "initializing" state right now... I
>>>> thought it was "up" before.
>>>
>>> Yes, this has gone "backwards" (not as far along yet...)
>>>
>>
>> I think getting to an "up" state is the first step.
>
> Were the ports getting to LinkUp/Active before partitions were configured ?

Yes, before I started trying to partition, all the nodes could
communicate... except they'd all use just one port on the server and I
couldn't get the throughput I needed.

>
>>>>> Any SM error messages ?
>>>>
>>>> The server has one klogd error coming out continuously:
>>>>
>>>> ib0: multicast join failed for
>>>> ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22
>>>
>>> IPoIB broadcast group (on the default partition) can't be joined (I'm
>>> presuming due to the current partition setup (e.g. it worked prior to
>>> this, right ?)).
>>>
>>> You need to do some IPoIB configuration relative to partitions as well.
>>> See kernel Documentation/infiniband/ipoib.txt for help with this.
>>>
>>
>> Will do.  As you say, the trick will be getting SRP to use the right
>> P_Key's... but I need to get the IB in an "up" state first.
>>
>> <snip sm output>
>>>>> Which server ?
>>>>
>>>> There's only one server... it has many ports for which I'm trying to
>>>> partition do different clients.  So, in the above, when I say "Server
>>>> A", I mean server port "A".
>>>
>>> I meant which server port is running OpenSM (which GUID is being
>>> used). I see above it is 0x24717124000029
>>
>> That was it.  I've switched to a client as the SM now, as you suggest
>> a stand-alone SM.
>
> So it's no longer a client in the ULP sense, right ?

It is just being used for SM now.

>
>>>
>>>>> You still need the default partition with the SM node being full and
>>>>> the others being limited there (so it's also best to run SM on
>>>>> separate node if possible otherwise you have the potential of any
>>>>> client connecting to it on default partition).
>>>>
>>>> Are you saying to change the partitions.conf file to:
>>>>
>>>> part1=0x1, ipoib: 0x0024717124000029=full, 0x0002c90300028c01;
>>>> part2=0x2, ipoib: 0x002471712400002a=full, 0x0002c90300026047;
>>>> part3=0x3, ipoib: 0x0024717127000035=full, 0x0002c90300026053;
>>>> part4=0x4, ipoib: 0x0024717127000036=full, 0x0002c9030002603b;
>>>
>>> That's part of it.
>>>
>>>> ... (which still doesn't work) in which case I set all the server's
>>>> ports to "full", or should just one be "full" (which didn't work
>>>> either)?
>>>
>>> You also need:
>>> Default=0x7fff: ALL, SELF=FULL;
>>> I would put that first.
>>
>> So, now my /etc/ofed/partitions.conf file looks like:
>>
>> Default=0x7fff: ALL, SELF=FULL;
>> part1=0x1, ipoib: 0x0002c903000292af=full, 0x0002c90300028c01;
>> part2=0x2, ipoib: 0x0002c903000292b0=full, 0x0002c90300026047;
>> part4=0x4, ipoib: 0x0024717124000029=full, 0x0002c9030002603b;
>
>> ... I pulled out the node on partition 3 to use as an SM exclusive
>> node, I also changed the server ports to some of the other IB ports on
>> that machine (port GUIDs as shown by ibstat).  I set the server port
>> GUID's to "full", as I want the client GUIDs to talk to it, but not
>> necessarily each other (as there is only one client GUID on each
>> partition now, it's a moot point).
>>
>> Note that I made-up the partition P_Key's of 1, 2, and 4.
>
> This all looks/sounds fine to me.

:(

>
>> Note that it still doesn't work.  On the stand-alone SM, ibstat looks like:
>>
>> # ibstat
>> CA 'mlx4_0'
>>        CA type: MT26428
>>        Number of ports: 2
>>        Firmware version: 2.6.0
>>        Hardware version: a0
>>        Node GUID: 0x0002c90300026052
>>        System image GUID: 0x0002c90300026055
>>        Port 1:
>>                State: Armed
>>                Physical state: LinkUp
>>                Rate: 10
>>                Base lid: 1
>>                LMC: 0
>>                SM lid: 1
>>                Capability mask: 0x0251086a
>>                Port GUID: 0x0002c90300026053
>>        Port 2:
>>                State: Down
>>                Physical state: Polling
>>                Rate: 10
>>                Base lid: 0
>>                LMC: 0
>>                SM lid: 0
>>                Capability mask: 0x02510868
>>                Port GUID: 0x0002c90300026054
>
> What's at the other end of port 1 ? Would you do smpquery portinfo for
> this HCA port and it's peer port ?
>
>> ... On the server, the devices mentioned in the partitions file look like:
>>
>> CA 'mlx4_0'
>>        CA type: MT25418
>>        Number of ports: 2
>>        Firmware version: 2.6.0
>>        Hardware version: a0
>>        Node GUID: 0x0024717124000028
>>        System image GUID: 0x002471712400002b
>>        Port 1:
>>                State: Initializing
>>                Physical state: LinkUp
>>                Rate: 10
>>                Base lid: 0
>>                LMC: 0
>>                SM lid: 0
>>                Capability mask: 0x02510868
>>                Port GUID: 0x0024717124000029
>>        Port 2:
>>                State: Initializing
>>                Physical state: LinkUp
>>                Rate: 10
>>                Base lid: 0
>>                LMC: 0
>>                SM lid: 0
>>                Capability mask: 0x02510868
>>                Port GUID: 0x002471712400002a
>> CA 'mlx4_1'
>>        CA type: MT26428
>>        Number of ports: 2
>>        Firmware version: 2.6.0
>>        Hardware version: a0
>>        Node GUID: 0x0002c903000292ae
>>        System image GUID: 0x0002c903000292b1
>>        Port 1:
>>                State: Initializing
>>                Physical state: LinkUp
>>                Rate: 10
>>                Base lid: 0
>>                LMC: 0
>>                SM lid: 0
>>                Capability mask: 0x02510868
>>                Port GUID: 0x0002c903000292af
>>        Port 2:
>>                State: Initializing
>>                Physical state: LinkUp
>>                Rate: 10
>>                Base lid: 0
>>                LMC: 0
>>                SM lid: 0
>>                Capability mask: 0x02510868
>>                Port GUID: 0x0002c903000292b0
>
> So no SM initialization is occurring there since they are still just in Init.

Correct.  But, the SM is running.

>
>> On one of the clients:
>>
>> # ibstat
>> CA 'mlx4_0'
>>        CA type: MT26428
>>        Number of ports: 2
>>        Firmware version: 2.6.0
>>        Hardware version: a0
>>        Node GUID: 0x0002c90300026046
>>        System image GUID: 0x0002c90300026049
>>        Port 1:
>>                State: Initializing
>>                Physical state: LinkUp
>>                Rate: 10
>>                Base lid: 7
>>                LMC: 0
>>                SM lid: 1
>>                Capability mask: 0x02510868
>>                Port GUID: 0x0002c90300026047
>>        Port 2:
>>                State: Down
>>                Physical state: Polling
>>                Rate: 10
>>                Base lid: 0
>>                LMC: 0
>>                SM lid: 0
>>                Capability mask: 0x02510868
>>                Port GUID: 0x0002c90300026048
>
> Ditto. Down means it's likely a port that is not connected.
>
>> Partition "part2" with P_Key=2 should connect this client's port 0 to
>> the sever on port 1 of mlx4_1
>
> Do you really mean port 0 ?

Nope... in this case I have 0x0002c903000292b0 in part2 in my
partitions file, which is port 1, the second port of the adapter.  I'm
hoping to use both ports of all adapters on the server.

>
>>>
>>>> I did have a difficult time understanding the difference between
>>>> "full" and "limited" in the man page.
>>>
>>> On a given partition, full can talk with all other members whereas a
>>> limited member can only talk with full members (not other limited
>>> members).
>>>
>>
>> I think I've got that correctly specified in the above partitions file.
>>
>>>> I've got a captive network, so I don't want any paths I've not
>>>> specified to be allowed.  If that makes any sense.  So, I didn't want
>>>> to put a statement in like:
>>>>
>>>> Default=0x7fff,ipoib:ALL=full;
>>>>
>>>> ... that would let a rogue node slip through the cracks.
>>>
>>> The only one they can talk with is the SM (the way I'm proposing) so
>>> it's best if the SM node could be separate.
>>
>> It's separate now.  The log looks like (in its entirety at statup):
>>
>> Apr 13 13:41:56 182699 [1D71CA30] 0x03 -> OpenSM 3.2.5_20081207
>> Apr 13 13:41:56 182764 [1D71CA30] 0x80 -> OpenSM 3.2.5_20081207
>> Apr 13 13:41:56 183020 [1D71CA30] 0x02 -> osm_vendor_init: 1000
>> pending umads specified
>> Apr 13 13:41:56 183104 [1D71CA30] 0x80 -> Entering DISCOVERING state
>> Apr 13 13:41:56 193181 [1D71CA30] 0x02 -> osm_vendor_bind: Binding to
>> port 0x2c90300026053
>> Apr 13 13:41:56 217349 [1D71CA30] 0x02 -> osm_vendor_bind: Binding to
>> port 0x2c90300026053
>> Apr 13 13:41:57 018570 [47FCE940] 0x01 -> umad_receiver: ERR 5409:
>> send completed with error (method=0x1 attr=0x11 trans_id=0x110000123b)
>> -- dropping
>> Apr 13 13:41:57 018586 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR
>> SMP Hop Ptr: 0x0
>> Apr 13 13:41:57 018603 [47FCE940] 0x01 -> Received SMP on a 1 hop path:
>>                                Initial path = 0,0
>>                                Return path  = 0,0
>> Apr 13 13:41:57 018608 [47FCE940] 0x01 ->
>> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
>> (IB_TIMEOUT)
>> Apr 13 13:41:57 018626 [47FCE940] 0x01 -> SMP dump:
>>                                base_ver................0x1
>>                                mgmt_class..............0x81
>>                                class_ver...............0x1
>>                                method..................0x1 (SubnGet)
>>                                D bit...................0x0
>>                                status..................0x0
>>                                hop_ptr.................0x0
>>                                hop_count...............0x1
>>                                trans_id................0x123b
>>                                attr_id.................0x11 (NodeInfo)
>>                                resv....................0x0
>>                                attr_mod................0x0
>>                                m_key...................0x0000000000000000
>>                                dr_slid.................65535
>>                                dr_dlid.................65535
>>
>>                                Initial path: 0,1
>>                                Return path:  0,0
>>                                Reserved:     [0][0][0][0][0][0][0]
>>
>>                                00 00 00 00 00 00 00 00   00 00 00 00
>> 00 00 00 00
>>
>>                                00 00 00 00 00 00 00 00   00 00 00 00
>> 00 00 00 00
>>
>>                                00 00 00 00 00 00 00 00   00 00 00 00
>> 00 00 00 00
>>
>>                                00 00 00 00 00 00 00 00   00 00 00 00
>> 00 00 00 00
>
> This is the first level problem. Some SMA is not responding to a
> NodeInfo query from the SM. Whatever is the next hop from the SM port
> appears not to be responding. You may need to reboot that device or
> otherwise reset it to see if this clears this issue.

After power-cycling the switch, the ports went "active"!  Note that I
didn't restart the SM... I just left it running.

So, on one client... the one corresponding to "part2" in the
partitions file, I put the P_Key into the "create child":

echo 0x2 > /sys/class/net/ib0/create_child

... and did likewise on the host, for ib3 (the second port on the
second adapter):

echo 0x2 > /sys/class/net/ib3/create_child

Still, no ping (the interfaces are setup correctly).

Thanks,

Chris
<snip>



More information about the general mailing list