***SPAM*** Re: [ofa-general] Any easy way to specify to the SM to route/zone?
Hal Rosenstock
hal.rosenstock at gmail.com
Mon Apr 13 14:01:10 PDT 2009
On Mon, Apr 13, 2009 at 4:09 PM, Chris Worley <worleys at gmail.com> wrote:
> On Mon, Apr 13, 2009 at 12:52 PM, Hal Rosenstock
> <hal.rosenstock at gmail.com> wrote:
>> On Mon, Apr 13, 2009 at 2:26 PM, Chris Worley <worleys at gmail.com> wrote:
>>> On Mon, Apr 13, 2009 at 11:53 AM, Hal Rosenstock
>>> <hal.rosenstock at gmail.com> wrote:
>>>> On Mon, Apr 13, 2009 at 12:02 PM, Chris Worley <worleys at gmail.com> wrote:
>>>>> On Mon, Apr 13, 2009 at 7:43 AM, Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
>>>>>> On Mon, Apr 13, 2009 at 9:37 AM, Chris Worley <worleys at gmail.com> wrote:
>>>>>>> On Mon, Apr 13, 2009 at 5:39 AM, Hal Rosenstock
>>>>>>> <hal.rosenstock at gmail.com> wrote:
>>>>>>>> On Sun, Apr 12, 2009 at 11:01 PM, Chris Worley <worleys at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> So I need to tell the SM to route specific ports on the server/target
>>>>>>>>> to specific clients/initiators.
>>>>>>>>>
>>>>>>>>> Is there any way to do this?
>>>>>>>>
>>>>>>>> Do you mean restrict access between certain clients/servers ?
>>>>>>>
>>>>>>> One server w/ 4QDR boards, 16 clients with one QDR board. I want each
>>>>>>> port on the server routed/zoned to two clients.
>>>>>>>
>>>>>>>> If so,
>>>>>>>> you can do this with partitioning
>>>>>>>
>>>>>>> What is partitioning?
>>>>>>
>>>>>> A partition is a collection of ports which are allowed to communicate
>>>>>> together. There are two forms of members: full members which can talk
>>>>>> to any other member (useful for servers) and limited members which can
>>>>>> only talk to full members (useful for clients). See the opensm man
>>>>>> page or partition-config.txt on setting this up for OpenSM.
>>>>>>
>>>>>
>>>>> Let me see if I understand this with a simple example... my port GUIDs
>>>>> (as reported by ibstat) are for one server (4 QDR ports) and four
>>>>> clients (one QDR port each):
>>>>>
>>>>>
>>>>> Server A: Port GUID: 0x0024717124000029
>>>>> Server B: Port GUID: 0x002471712400002a
>>>>> Server C: Port GUID: 0x0024717127000035
>>>>> Server D: Port GUID: 0x0024717127000036
>>>>>
>>>>> Client 1: Port GUID: 0x0002c90300028c01
>>>>> Client 2: Port GUID: 0x0002c90300026047
>>>>> Client 3: Port GUID: 0x0002c90300026053
>>>>> Client 4: Port GUID: 0x0002c9030002603b
Is there a switch in between or just back to back HCA ports ?
>>>>>
>>>>> Assuming I want a 1:1 (one server port to one client) partitioning, I
>>>>> would put the following in /etc/ofed/partitions.conf:
>>>>>
>>>>> part1=0x1, ipoib, defmember=full : 0x0024717124000029, 0x0002c90300028c01;
>>>>> part2=0x2, ipoib, defmember=full : 0x002471712400002a, 0x0002c90300026047;
>>>>> part3=0x3, ipoib, defmember=full : 0x0024717127000035, 0x0002c90300026053;
>>>>> part4=0x4, ipoib, defmember=full : 0x0024717127000036, 0x0002c9030002603b;
>>>>
>>>> So you want IPoIB.
>>>
>>> I'm doing SRP, so I need IPoIB working.
>>
>> SRP needs to query PathRecord with the correct PKey and use the
>> correct Pkey index for that partition. I'm not sure how that is done
>> in SRP but first IPoIB needs to be made to work (again).
>>
>
> Okay... I'll setup the IPoIB as the ipoib.txt suggests, i.e.:
>
> echo 0x1 > /sys/class/net/ib0/create_child
>
> ... but for now, I'm still not seeing the state go to "up"... I think
> that's the first problem.
Yes, port state needs to be linkup/active first. I see LinkUp/Armed from below.
>>>>
>>>>> ... and run w/:
>>>>>
>>>>> opensm -r -B -P/etc/ofed/partitions.conf
>>
>> Also, do you need to use -r ? It's better not to (reassign LIDs).
>
> I'm using it to assure that it just doesn't hang on to the old state,
> especially since I'm not getting the SM working...
OK.
> I don't want it to
> assume anything is right about the previous state.
>
> I have tried w/ and w/o and don't see a difference.
>
> The plan is, once I get it working, to remove the "-r".
That's fine.
> Or, are you suggesting I not use it?
>
>>
>>>>> Does that sound correct? It doesn't work
>>>>
>>>> What application(s) aren't working ?
>>>
>>> ping over IPoIB, for example.
>>>
>>> I am seeing the test node in an "initializing" state right now... I
>>> thought it was "up" before.
>>
>> Yes, this has gone "backwards" (not as far along yet...)
>>
>
> I think getting to an "up" state is the first step.
Were the ports getting to LinkUp/Active before partitions were configured ?
>>>> Any SM error messages ?
>>>
>>> The server has one klogd error coming out continuously:
>>>
>>> ib0: multicast join failed for
>>> ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22
>>
>> IPoIB broadcast group (on the default partition) can't be joined (I'm
>> presuming due to the current partition setup (e.g. it worked prior to
>> this, right ?)).
>>
>> You need to do some IPoIB configuration relative to partitions as well.
>> See kernel Documentation/infiniband/ipoib.txt for help with this.
>>
>
> Will do. As you say, the trick will be getting SRP to use the right
> P_Key's... but I need to get the IB in an "up" state first.
>
> <snip sm output>
>>>> Which server ?
>>>
>>> There's only one server... it has many ports for which I'm trying to
>>> partition do different clients. So, in the above, when I say "Server
>>> A", I mean server port "A".
>>
>> I meant which server port is running OpenSM (which GUID is being
>> used). I see above it is 0x24717124000029
>
> That was it. I've switched to a client as the SM now, as you suggest
> a stand-alone SM.
So it's no longer a client in the ULP sense, right ?
>>
>>>> You still need the default partition with the SM node being full and
>>>> the others being limited there (so it's also best to run SM on
>>>> separate node if possible otherwise you have the potential of any
>>>> client connecting to it on default partition).
>>>
>>> Are you saying to change the partitions.conf file to:
>>>
>>> part1=0x1, ipoib: 0x0024717124000029=full, 0x0002c90300028c01;
>>> part2=0x2, ipoib: 0x002471712400002a=full, 0x0002c90300026047;
>>> part3=0x3, ipoib: 0x0024717127000035=full, 0x0002c90300026053;
>>> part4=0x4, ipoib: 0x0024717127000036=full, 0x0002c9030002603b;
>>
>> That's part of it.
>>
>>> ... (which still doesn't work) in which case I set all the server's
>>> ports to "full", or should just one be "full" (which didn't work
>>> either)?
>>
>> You also need:
>> Default=0x7fff: ALL, SELF=FULL;
>> I would put that first.
>
> So, now my /etc/ofed/partitions.conf file looks like:
>
> Default=0x7fff: ALL, SELF=FULL;
> part1=0x1, ipoib: 0x0002c903000292af=full, 0x0002c90300028c01;
> part2=0x2, ipoib: 0x0002c903000292b0=full, 0x0002c90300026047;
> part4=0x4, ipoib: 0x0024717124000029=full, 0x0002c9030002603b;
> ... I pulled out the node on partition 3 to use as an SM exclusive
> node, I also changed the server ports to some of the other IB ports on
> that machine (port GUIDs as shown by ibstat). I set the server port
> GUID's to "full", as I want the client GUIDs to talk to it, but not
> necessarily each other (as there is only one client GUID on each
> partition now, it's a moot point).
>
> Note that I made-up the partition P_Key's of 1, 2, and 4.
This all looks/sounds fine to me.
> Note that it still doesn't work. On the stand-alone SM, ibstat looks like:
>
> # ibstat
> CA 'mlx4_0'
> CA type: MT26428
> Number of ports: 2
> Firmware version: 2.6.0
> Hardware version: a0
> Node GUID: 0x0002c90300026052
> System image GUID: 0x0002c90300026055
> Port 1:
> State: Armed
> Physical state: LinkUp
> Rate: 10
> Base lid: 1
> LMC: 0
> SM lid: 1
> Capability mask: 0x0251086a
> Port GUID: 0x0002c90300026053
> Port 2:
> State: Down
> Physical state: Polling
> Rate: 10
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask: 0x02510868
> Port GUID: 0x0002c90300026054
What's at the other end of port 1 ? Would you do smpquery portinfo for
this HCA port and it's peer port ?
> ... On the server, the devices mentioned in the partitions file look like:
>
> CA 'mlx4_0'
> CA type: MT25418
> Number of ports: 2
> Firmware version: 2.6.0
> Hardware version: a0
> Node GUID: 0x0024717124000028
> System image GUID: 0x002471712400002b
> Port 1:
> State: Initializing
> Physical state: LinkUp
> Rate: 10
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask: 0x02510868
> Port GUID: 0x0024717124000029
> Port 2:
> State: Initializing
> Physical state: LinkUp
> Rate: 10
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask: 0x02510868
> Port GUID: 0x002471712400002a
> CA 'mlx4_1'
> CA type: MT26428
> Number of ports: 2
> Firmware version: 2.6.0
> Hardware version: a0
> Node GUID: 0x0002c903000292ae
> System image GUID: 0x0002c903000292b1
> Port 1:
> State: Initializing
> Physical state: LinkUp
> Rate: 10
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask: 0x02510868
> Port GUID: 0x0002c903000292af
> Port 2:
> State: Initializing
> Physical state: LinkUp
> Rate: 10
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask: 0x02510868
> Port GUID: 0x0002c903000292b0
So no SM initialization is occurring there since they are still just in Init.
> On one of the clients:
>
> # ibstat
> CA 'mlx4_0'
> CA type: MT26428
> Number of ports: 2
> Firmware version: 2.6.0
> Hardware version: a0
> Node GUID: 0x0002c90300026046
> System image GUID: 0x0002c90300026049
> Port 1:
> State: Initializing
> Physical state: LinkUp
> Rate: 10
> Base lid: 7
> LMC: 0
> SM lid: 1
> Capability mask: 0x02510868
> Port GUID: 0x0002c90300026047
> Port 2:
> State: Down
> Physical state: Polling
> Rate: 10
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask: 0x02510868
> Port GUID: 0x0002c90300026048
Ditto. Down means it's likely a port that is not connected.
> Partition "part2" with P_Key=2 should connect this client's port 0 to
> the sever on port 1 of mlx4_1
Do you really mean port 0 ?
>>
>>> I did have a difficult time understanding the difference between
>>> "full" and "limited" in the man page.
>>
>> On a given partition, full can talk with all other members whereas a
>> limited member can only talk with full members (not other limited
>> members).
>>
>
> I think I've got that correctly specified in the above partitions file.
>
>>> I've got a captive network, so I don't want any paths I've not
>>> specified to be allowed. If that makes any sense. So, I didn't want
>>> to put a statement in like:
>>>
>>> Default=0x7fff,ipoib:ALL=full;
>>>
>>> ... that would let a rogue node slip through the cracks.
>>
>> The only one they can talk with is the SM (the way I'm proposing) so
>> it's best if the SM node could be separate.
>
> It's separate now. The log looks like (in its entirety at statup):
>
> Apr 13 13:41:56 182699 [1D71CA30] 0x03 -> OpenSM 3.2.5_20081207
> Apr 13 13:41:56 182764 [1D71CA30] 0x80 -> OpenSM 3.2.5_20081207
> Apr 13 13:41:56 183020 [1D71CA30] 0x02 -> osm_vendor_init: 1000
> pending umads specified
> Apr 13 13:41:56 183104 [1D71CA30] 0x80 -> Entering DISCOVERING state
> Apr 13 13:41:56 193181 [1D71CA30] 0x02 -> osm_vendor_bind: Binding to
> port 0x2c90300026053
> Apr 13 13:41:56 217349 [1D71CA30] 0x02 -> osm_vendor_bind: Binding to
> port 0x2c90300026053
> Apr 13 13:41:57 018570 [47FCE940] 0x01 -> umad_receiver: ERR 5409:
> send completed with error (method=0x1 attr=0x11 trans_id=0x110000123b)
> -- dropping
> Apr 13 13:41:57 018586 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR
> SMP Hop Ptr: 0x0
> Apr 13 13:41:57 018603 [47FCE940] 0x01 -> Received SMP on a 1 hop path:
> Initial path = 0,0
> Return path = 0,0
> Apr 13 13:41:57 018608 [47FCE940] 0x01 ->
> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
> (IB_TIMEOUT)
> Apr 13 13:41:57 018626 [47FCE940] 0x01 -> SMP dump:
> base_ver................0x1
> mgmt_class..............0x81
> class_ver...............0x1
> method..................0x1 (SubnGet)
> D bit...................0x0
> status..................0x0
> hop_ptr.................0x0
> hop_count...............0x1
> trans_id................0x123b
> attr_id.................0x11 (NodeInfo)
> resv....................0x0
> attr_mod................0x0
> m_key...................0x0000000000000000
> dr_slid.................65535
> dr_dlid.................65535
>
> Initial path: 0,1
> Return path: 0,0
> Reserved: [0][0][0][0][0][0][0]
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
This is the first level problem. Some SMA is not responding to a
NodeInfo query from the SM. Whatever is the next hop from the SM port
appears not to be responding. You may need to reboot that device or
otherwise reset it to see if this clears this issue.
-- Hal
> Apr 13 13:41:57 018681 [475CD940] 0x80 -> Entering MASTER state
> Apr 13 13:41:57 019791 [475CD940] 0x80 -> SUBNET UP
> Apr 13 13:42:06 986336 [47FCE940] 0x01 -> umad_receiver: ERR 5409:
> send completed with error (method=0x1 attr=0x11 trans_id=0x1100001242)
> -- dropping
> Apr 13 13:42:06 986349 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR
> SMP Hop Ptr: 0x0
> Apr 13 13:42:06 986355 [47FCE940] 0x01 -> Received SMP on a 1 hop path:
> Initial path = 0,0
> Return path = 0,0
> Apr 13 13:42:06 986360 [47FCE940] 0x01 ->
> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
> (IB_TIMEOUT)
> Apr 13 13:42:06 986376 [47FCE940] 0x01 -> SMP dump:
> base_ver................0x1
> mgmt_class..............0x81
> class_ver...............0x1
> method..................0x1 (SubnGet)
> D bit...................0x0
> status..................0x0
> hop_ptr.................0x0
> hop_count...............0x1
> trans_id................0x1242
> attr_id.................0x11 (NodeInfo)
> resv....................0x0
> attr_mod................0x0
> m_key...................0x0000000000000000
> dr_slid.................65535
> dr_dlid.................65535
>
> Initial path: 0,1
> Return path: 0,0
> Reserved: [0][0][0][0][0][0][0]
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> Apr 13 13:42:06 986708 [475CD940] 0x02 -> SUBNET UP
> Apr 13 13:42:16 990103 [47FCE940] 0x01 -> umad_receiver: ERR 5409:
> send completed with error (method=0x1 attr=0x11 trans_id=0x1100001246)
> -- dropping
> Apr 13 13:42:16 990114 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR
> SMP Hop Ptr: 0x0
> Apr 13 13:42:16 990120 [47FCE940] 0x01 -> Received SMP on a 1 hop path:
> Initial path = 0,0
> Return path = 0,0
> Apr 13 13:42:16 990125 [47FCE940] 0x01 ->
> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
> (IB_TIMEOUT)
> Apr 13 13:42:16 990141 [47FCE940] 0x01 -> SMP dump:
> base_ver................0x1
> mgmt_class..............0x81
> class_ver...............0x1
> method..................0x1 (SubnGet)
> D bit...................0x0
> status..................0x0
> hop_ptr.................0x0
> hop_count...............0x1
> trans_id................0x1246
> attr_id.................0x11 (NodeInfo)
> resv....................0x0
> attr_mod................0x0
> m_key...................0x0000000000000000
> dr_slid.................65535
> dr_dlid.................65535
>
> Initial path: 0,1
> Return path: 0,0
> Reserved: [0][0][0][0][0][0][0]
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> Apr 13 13:42:16 990475 [475CD940] 0x02 -> SUBNET UP
> Apr 13 13:42:26 990871 [47FCE940] 0x01 -> umad_receiver: ERR 5409:
> send completed with error (method=0x1 attr=0x11 trans_id=0x110000124a)
> -- dropping
> Apr 13 13:42:26 990884 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR
> SMP Hop Ptr: 0x0
> Apr 13 13:42:26 990890 [47FCE940] 0x01 -> Received SMP on a 1 hop path:
> Initial path = 0,0
> Return path = 0,0
> Apr 13 13:42:26 990895 [47FCE940] 0x01 ->
> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
> (IB_TIMEOUT)
> Apr 13 13:42:26 990912 [47FCE940] 0x01 -> SMP dump:
> base_ver................0x1
> mgmt_class..............0x81
> class_ver...............0x1
> method..................0x1 (SubnGet)
> D bit...................0x0
> status..................0x0
> hop_ptr.................0x0
> hop_count...............0x1
> trans_id................0x124a
> attr_id.................0x11 (NodeInfo)
> resv....................0x0
> attr_mod................0x0
> m_key...................0x0000000000000000
> dr_slid.................65535
> dr_dlid.................65535
>
> Initial path: 0,1
> Return path: 0,0
> Reserved: [0][0][0][0][0][0][0]
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> Apr 13 13:42:26 991227 [475CD940] 0x02 -> SUBNET UP
> Apr 13 13:42:36 993638 [47FCE940] 0x01 -> umad_receiver: ERR 5409:
> send completed with error (method=0x1 attr=0x11 trans_id=0x110000124e)
> -- dropping
> Apr 13 13:42:36 993649 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR
> SMP Hop Ptr: 0x0
> Apr 13 13:42:36 993655 [47FCE940] 0x01 -> Received SMP on a 1 hop path:
> Initial path = 0,0
> Return path = 0,0
> Apr 13 13:42:36 993660 [47FCE940] 0x01 ->
> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
> (IB_TIMEOUT)
> Apr 13 13:42:36 993676 [47FCE940] 0x01 -> SMP dump:
> base_ver................0x1
> mgmt_class..............0x81
> class_ver...............0x1
> method..................0x1 (SubnGet)
> D bit...................0x0
> status..................0x0
> hop_ptr.................0x0
> hop_count...............0x1
> trans_id................0x124e
> attr_id.................0x11 (NodeInfo)
> resv....................0x0
> attr_mod................0x0
> m_key...................0x0000000000000000
> dr_slid.................65535
> dr_dlid.................65535
>
> Initial path: 0,1
> Return path: 0,0
> Reserved: [0][0][0][0][0][0][0]
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> Apr 13 13:42:36 993996 [475CD940] 0x02 -> SUBNET UP
> Apr 13 13:42:46 996409 [47FCE940] 0x01 -> umad_receiver: ERR 5409:
> send completed with error (method=0x1 attr=0x11 trans_id=0x1100001252)
> -- dropping
> Apr 13 13:42:46 996420 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR
> SMP Hop Ptr: 0x0
> Apr 13 13:42:46 996426 [47FCE940] 0x01 -> Received SMP on a 1 hop path:
> Initial path = 0,0
> Return path = 0,0
> Apr 13 13:42:46 996431 [47FCE940] 0x01 ->
> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
> (IB_TIMEOUT)
> Apr 13 13:42:46 996449 [47FCE940] 0x01 -> SMP dump:
> base_ver................0x1
> mgmt_class..............0x81
> class_ver...............0x1
> method..................0x1 (SubnGet)
> D bit...................0x0
> status..................0x0
> hop_ptr.................0x0
> hop_count...............0x1
> trans_id................0x1252
> attr_id.................0x11 (NodeInfo)
> resv....................0x0
> attr_mod................0x0
> m_key...................0x0000000000000000
> dr_slid.................65535
> dr_dlid.................65535
>
> Initial path: 0,1
> Return path: 0,0
> Reserved: [0][0][0][0][0][0][0]
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> Apr 13 13:42:46 996800 [475CD940] 0x02 -> SUBNET UP
> Apr 13 13:42:56 999180 [47FCE940] 0x01 -> umad_receiver: ERR 5409:
> send completed with error (method=0x1 attr=0x11 trans_id=0x1100001256)
> -- dropping
> Apr 13 13:42:56 999192 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR
> SMP Hop Ptr: 0x0
> Apr 13 13:42:56 999198 [47FCE940] 0x01 -> Received SMP on a 1 hop path:
> Initial path = 0,0
> Return path = 0,0
> Apr 13 13:42:56 999203 [47FCE940] 0x01 ->
> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
> (IB_TIMEOUT)
> Apr 13 13:42:56 999220 [47FCE940] 0x01 -> SMP dump:
> base_ver................0x1
> mgmt_class..............0x81
> class_ver...............0x1
> method..................0x1 (SubnGet)
> D bit...................0x0
> status..................0x0
> hop_ptr.................0x0
> hop_count...............0x1
> trans_id................0x1256
> attr_id.................0x11 (NodeInfo)
> resv....................0x0
> attr_mod................0x0
> m_key...................0x0000000000000000
> dr_slid.................65535
> dr_dlid.................65535
>
> Initial path: 0,1
> Return path: 0,0
> Reserved: [0][0][0][0][0][0][0]
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> Apr 13 13:42:56 999553 [475CD940] 0x02 -> SUBNET UP
> Apr 13 13:43:07 001949 [47FCE940] 0x01 -> umad_receiver: ERR 5409:
> send completed with error (method=0x1 attr=0x11 trans_id=0x110000125a)
> -- dropping
> Apr 13 13:43:07 001963 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR
> SMP Hop Ptr: 0x0
> Apr 13 13:43:07 001969 [47FCE940] 0x01 -> Received SMP on a 1 hop path:
> Initial path = 0,0
> Return path = 0,0
> Apr 13 13:43:07 001975 [47FCE940] 0x01 ->
> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
> (IB_TIMEOUT)
> Apr 13 13:43:07 001992 [47FCE940] 0x01 -> SMP dump:
> base_ver................0x1
> mgmt_class..............0x81
> class_ver...............0x1
> method..................0x1 (SubnGet)
> D bit...................0x0
> status..................0x0
> hop_ptr.................0x0
> hop_count...............0x1
> trans_id................0x125a
> attr_id.................0x11 (NodeInfo)
> resv....................0x0
> attr_mod................0x0
> m_key...................0x0000000000000000
> dr_slid.................65535
> dr_dlid.................65535
>
> Initial path: 0,1
> Return path: 0,0
> Reserved: [0][0][0][0][0][0][0]
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> Apr 13 13:43:07 002384 [475CD940] 0x02 -> SUBNET UP
> Apr 13 13:43:17 004713 [47FCE940] 0x01 -> umad_receiver: ERR 5409:
> send completed with error (method=0x1 attr=0x11 trans_id=0x110000125e)
> -- dropping
> Apr 13 13:43:17 004727 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR
> SMP Hop Ptr: 0x0
> Apr 13 13:43:17 004733 [47FCE940] 0x01 -> Received SMP on a 1 hop path:
> Initial path = 0,0
> Return path = 0,0
> Apr 13 13:43:17 004738 [47FCE940] 0x01 ->
> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
> (IB_TIMEOUT)
> Apr 13 13:43:17 004755 [47FCE940] 0x01 -> SMP dump:
> base_ver................0x1
> mgmt_class..............0x81
> class_ver...............0x1
> method..................0x1 (SubnGet)
> D bit...................0x0
> status..................0x0
> hop_ptr.................0x0
> hop_count...............0x1
> trans_id................0x125e
> attr_id.................0x11 (NodeInfo)
> resv....................0x0
> attr_mod................0x0
> m_key...................0x0000000000000000
> dr_slid.................65535
> dr_dlid.................65535
>
> Initial path: 0,1
> Return path: 0,0
> Reserved: [0][0][0][0][0][0][0]
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> Apr 13 13:43:17 005140 [475CD940] 0x02 -> SUBNET UP
> Apr 13 13:43:27 007482 [47FCE940] 0x01 -> umad_receiver: ERR 5409:
> send completed with error (method=0x1 attr=0x11 trans_id=0x1100001262)
> -- dropping
> Apr 13 13:43:27 007497 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR
> SMP Hop Ptr: 0x0
> Apr 13 13:43:27 007503 [47FCE940] 0x01 -> Received SMP on a 1 hop path:
> Initial path = 0,0
> Return path = 0,0
> Apr 13 13:43:27 007508 [47FCE940] 0x01 ->
> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
> (IB_TIMEOUT)
> Apr 13 13:43:27 007524 [47FCE940] 0x01 -> SMP dump:
> base_ver................0x1
> mgmt_class..............0x81
> class_ver...............0x1
> method..................0x1 (SubnGet)
> D bit...................0x0
> status..................0x0
> hop_ptr.................0x0
> hop_count...............0x1
> trans_id................0x1262
> attr_id.................0x11 (NodeInfo)
> resv....................0x0
> attr_mod................0x0
> m_key...................0x0000000000000000
> dr_slid.................65535
> dr_dlid.................65535
>
> Initial path: 0,1
> Return path: 0,0
> Reserved: [0][0][0][0][0][0][0]
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> Apr 13 13:43:27 007958 [475CD940] 0x02 -> SUBNET UP
> Apr 13 13:43:37 010250 [47FCE940] 0x01 -> umad_receiver: ERR 5409:
> send completed with error (method=0x1 attr=0x11 trans_id=0x1100001266)
> -- dropping
> Apr 13 13:43:37 010264 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR
> SMP Hop Ptr: 0x0
> Apr 13 13:43:37 010270 [47FCE940] 0x01 -> Received SMP on a 1 hop path:
> Initial path = 0,0
> Return path = 0,0
> Apr 13 13:43:37 010275 [47FCE940] 0x01 ->
> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
> (IB_TIMEOUT)
> Apr 13 13:43:37 010292 [47FCE940] 0x01 -> SMP dump:
> base_ver................0x1
> mgmt_class..............0x81
> class_ver...............0x1
> method..................0x1 (SubnGet)
> D bit...................0x0
> status..................0x0
> hop_ptr.................0x0
> hop_count...............0x1
> trans_id................0x1266
> attr_id.................0x11 (NodeInfo)
> resv....................0x0
> attr_mod................0x0
> m_key...................0x0000000000000000
> dr_slid.................65535
> dr_dlid.................65535
>
> Initial path: 0,1
> Return path: 0,0
> Reserved: [0][0][0][0][0][0][0]
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> Apr 13 13:43:37 010716 [475CD940] 0x02 -> SUBNET UP
> Apr 13 13:43:47 013017 [47FCE940] 0x01 -> umad_receiver: ERR 5409:
> send completed with error (method=0x1 attr=0x11 trans_id=0x110000126a)
> -- dropping
> Apr 13 13:43:47 013029 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR
> SMP Hop Ptr: 0x0
> Apr 13 13:43:47 013035 [47FCE940] 0x01 -> Received SMP on a 1 hop path:
> Initial path = 0,0
> Return path = 0,0
> Apr 13 13:43:47 013059 [47FCE940] 0x01 ->
> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
> (IB_TIMEOUT)
> Apr 13 13:43:47 013077 [47FCE940] 0x01 -> SMP dump:
> base_ver................0x1
> mgmt_class..............0x81
> class_ver...............0x1
> method..................0x1 (SubnGet)
> D bit...................0x0
> status..................0x0
> hop_ptr.................0x0
> hop_count...............0x1
> trans_id................0x126a
> attr_id.................0x11 (NodeInfo)
> resv....................0x0
> attr_mod................0x0
> m_key...................0x0000000000000000
> dr_slid.................65535
> dr_dlid.................65535
>
> Initial path: 0,1
> Return path: 0,0
> Reserved: [0][0][0][0][0][0][0]
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00
>
>>
>> In order for SA portion of SM to work, SM node must be a full member
>> of the default partition and other nodes must be at least limited
>> members (so their queries will be responded to). IPoIB is not needed
>> on that partition.
>
> I think I've got the partition file specified correctly... but then
> again obviously not, as it doesn't work.
>
> Thanks,
>
> Chris
>
More information about the general
mailing list