***SPAM*** Re: [ofa-general] Any easy way to specify to the SM to route/zone?

Hal Rosenstock hal.rosenstock at gmail.com
Mon Apr 13 11:52:08 PDT 2009


On Mon, Apr 13, 2009 at 2:26 PM, Chris Worley <worleys at gmail.com> wrote:
> On Mon, Apr 13, 2009 at 11:53 AM, Hal Rosenstock
> <hal.rosenstock at gmail.com> wrote:
>> On Mon, Apr 13, 2009 at 12:02 PM, Chris Worley <worleys at gmail.com> wrote:
>>> On Mon, Apr 13, 2009 at 7:43 AM, Hal Rosenstock <hal.rosenstock at gmail.com> wrote:
>>>> On Mon, Apr 13, 2009 at 9:37 AM, Chris Worley <worleys at gmail.com> wrote:
>>>>> On Mon, Apr 13, 2009 at 5:39 AM, Hal Rosenstock
>>>>> <hal.rosenstock at gmail.com> wrote:
>>>>>> On Sun, Apr 12, 2009 at 11:01 PM, Chris Worley <worleys at gmail.com> wrote:
>>>>>>>
>>>>>>> So I need to tell the SM to route specific ports on the server/target
>>>>>>> to specific clients/initiators.
>>>>>>>
>>>>>>> Is there any way to do this?
>>>>>>
>>>>>> Do you mean restrict access between certain clients/servers ?
>>>>>
>>>>> One server w/ 4QDR boards, 16 clients with one QDR board.  I want each
>>>>> port on the server routed/zoned to two clients.
>>>>>
>>>>>> If so,
>>>>>> you can do this with partitioning
>>>>>
>>>>> What is partitioning?
>>>>
>>>> A partition is a collection of ports which are allowed to communicate
>>>> together. There are two forms of members: full members which can talk
>>>> to any other member (useful for servers) and limited members which can
>>>> only talk to full members (useful for clients). See the opensm man
>>>> page or partition-config.txt on setting this up for OpenSM.
>>>>
>>>
>>> Let me see if I understand this with a simple example... my port GUIDs
>>> (as reported by ibstat) are for one server (4 QDR ports) and four
>>> clients (one QDR port each):
>>>
>>>
>>> Server A:           Port GUID: 0x0024717124000029
>>> Server B:           Port GUID: 0x002471712400002a
>>> Server C:           Port GUID: 0x0024717127000035
>>> Server D:           Port GUID: 0x0024717127000036
>>>
>>> Client 1:                Port GUID: 0x0002c90300028c01
>>> Client 2:                Port GUID: 0x0002c90300026047
>>> Client 3:                Port GUID: 0x0002c90300026053
>>> Client 4:                Port GUID: 0x0002c9030002603b
>>>
>>> Assuming I want a 1:1 (one server port to one client) partitioning, I
>>> would put the following in /etc/ofed/partitions.conf:
>>>
>>> part1=0x1, ipoib, defmember=full : 0x0024717124000029, 0x0002c90300028c01;
>>> part2=0x2, ipoib, defmember=full : 0x002471712400002a, 0x0002c90300026047;
>>> part3=0x3, ipoib, defmember=full : 0x0024717127000035, 0x0002c90300026053;
>>> part4=0x4, ipoib, defmember=full : 0x0024717127000036, 0x0002c9030002603b;
>>
>> So you want IPoIB.
>
> I'm doing SRP, so I need IPoIB working.

SRP needs to query PathRecord with the correct PKey and use the
correct Pkey index for that partition. I'm not sure how that is done
in SRP but first IPoIB needs to be made to work (again).

>>
>>> ... and run w/:
>>>
>>> opensm -r -B -P/etc/ofed/partitions.conf

Also, do you need to use -r ? It's better not to (reassign LIDs).

>>> Does that sound correct?  It doesn't work
>>
>> What application(s) aren't working ?
>
> ping over IPoIB, for example.
>
> I am seeing the test node in an "initializing" state right now... I
> thought it was "up" before.

Yes, this has gone "backwards" (not as far along yet...)

>> Any SM error messages ?
>
> The server has one klogd error coming out continuously:
>
> ib0: multicast join failed for
> ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22

IPoIB broadcast group (on the default partition) can't be joined (I'm
presuming due to the current partition setup (e.g. it worked prior to
this, right ?)).

You need to do some IPoIB configuration relative to partitions as well.
See kernel Documentation/infiniband/ipoib.txt for help with this.

> OpenSM is seeing "lid out of range", "send completed with error",
> "Failed to find source physical port for trap"
> Opensm's log looks like:
>
> Apr 13 12:03:43 556996 [21085350] 0x03 -> OpenSM 3.2.2
> Apr 13 12:03:43 557061 [21085350] 0x80 -> OpenSM 3.2.2
> Apr 13 12:03:43 557556 [21085350] 0x02 -> osm_vendor_init: 1000
> pending umads specified
> Apr 13 12:03:43 557659 [21085350] 0x80 -> Entering DISCOVERING state
> Apr 13 12:03:43 605573 [21085350] 0x02 -> osm_vendor_bind: Binding to
> port 0x24717124000029
> Apr 13 12:03:43 636142 [21085350] 0x02 -> osm_vendor_bind: Binding to
> port 0x24717124000029
> Apr 13 12:03:44 437076 [4863C940] 0x01 -> umad_receiver: ERR 5409:
> send completed with error (method=0x1 attr=0x11 trans_id=0x520000123b)
> -- dropping
> Apr 13 12:03:44 437104 [4863C940] 0x01 -> umad_receiver: ERR 5411: DR
> SMP Hop Ptr: 0x0
> Apr 13 12:03:44 437126 [4863C940] 0x01 -> Received SMP on a 1 hop path:
>                                Initial path = 0,0
>                                Return path  = 0,0
> Apr 13 12:03:44 437135 [4863C940] 0x01 ->
> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
> (IB_TIMEOUT)
> Apr 13 12:03:44 437179 [4863C940] 0x01 -> SMP dump:
>                                base_ver................0x1
>                                mgmt_class..............0x81
>                                class_ver...............0x1
>                                method..................0x1 (SubnGet)
>                                D bit...................0x0
>                                status..................0x0
>                                hop_ptr.................0x0
>                                hop_count...............0x1
>                                trans_id................0x123b
>                                attr_id.................0x11 (NodeInfo)
>                                resv....................0x0
>                                attr_mod................0x0
>                                m_key...................0x0000000000000000
>                                dr_slid.................65535
>                                dr_dlid.................65535
>
>                                Initial path: 0,1
>                                Return path:  0,0
>                                Reserved:     [0][0][0][0][0][0][0]
>
>                                00 00 00 00 00 00 00 00   00 00 00 00
> 00 00 00 00
>
>                                00 00 00 00 00 00 00 00   00 00 00 00
> 00 00 00 00
>
>                                00 00 00 00 00 00 00 00   00 00 00 00
> 00 00 00 00
>
>                                00 00 00 00 00 00 00 00   00 00 00 00
> 00 00 00 00
>
> Apr 13 12:03:44 437218 [47C3B940] 0x80 -> Entering MASTER state
> Apr 13 12:03:44 437409 [47C3B940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:66 from LID:0
> GID:0xfe80000000000000,0x0024717124000029
> Apr 13 12:03:44 437458 [47C3B940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:66 from LID:0
> GID:0xfe80000000000000,0x0024717124000029
> Apr 13 12:03:44 437514 [47C3B940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:66 from LID:0
> GID:0xfe80000000000000,0x0024717124000029
> Apr 13 12:03:44 437558 [47C3B940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:66 from LID:0
> GID:0xfe80000000000000,0x0024717124000029
> Apr 13 12:03:44 437612 [47C3B940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:66 from LID:0
> GID:0xfe80000000000000,0x0024717124000029
> Apr 13 12:03:44 437653 [47C3B940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:66 from LID:0
> GID:0xfe80000000000000,0x0024717124000029
> Apr 13 12:03:44 437707 [47C3B940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:66 from LID:0
> GID:0xfe80000000000000,0x0024717124000029
> Apr 13 12:03:44 437748 [47C3B940] 0x02 -> osm_report_notice: Reporting
> Generic Notice type:3 num:66 from LID:0
> GID:0xfe80000000000000,0x0024717124000029
> Apr 13 12:03:44 443077 [47C3B940] 0x80 -> SUBNET UP
> Apr 13 12:03:44 891932 [42232940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f
> Apr 13 12:03:44 891951 [42232940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:03:44 891959 [42232940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:03:45 184124 [44035940] 0x01 -> __osm_mcmr_rcv_join_mgrp:
> ERR 1B11: method = SubnAdmSet, scope_state = 0x1, component mask =
> 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
> 0xff12401bffff0000 : 0x00000000ffffffff from port 0x0
> 24717124000029 (MT25408)
>
> ...
>
> Apr 13 12:04:04 852289 [43634940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f
> Apr 13 12:04:04 852306 [43634940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:04:04 852314 [43634940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:04:04 852363 [43634940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3804: Received trap 20 times
> consecutively
> Apr 13 12:04:05 850307 [44035940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f
> Apr 13 12:04:05 850327 [44035940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:04:05 850334 [44035940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:04:06 848327 [44A36940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f
> Apr 13 12:04:06 848340 [44A36940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:04:06 848348 [44A36940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:04:07 846349 [45437940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f
> Apr 13 12:04:07 846365 [45437940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:04:07 846373 [45437940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:04:08 844372 [45E38940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f
> Apr 13 12:04:08 844391 [45E38940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:04:08 844398 [45E38940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:04:09 842394 [46839940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f
> Apr 13 12:04:09 842414 [46839940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:04:09 842421 [46839940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:04:10 840400 [42232940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f
> Apr 13 12:04:10 840414 [42232940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:04:10 840421 [42232940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:04:11 838419 [42C33940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f
> Apr 13 12:04:11 838432 [42C33940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:04:11 838440 [42C33940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:04:12 836435 [43634940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f
> Apr 13 12:04:12 836467 [43634940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:04:12 836476 [43634940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:04:13 834459 [45437940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f
> Apr 13 12:04:13 834479 [45437940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:04:13 834487 [45437940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:04:14 364185 [4863C940] 0x01 -> umad_receiver: ERR 5409:
> send completed with error (method=0x1 attr=0x11 trans_id=0x5200001266)
> -- dropping
> Apr 13 12:04:14 364211 [4863C940] 0x01 -> umad_receiver: ERR 5411: DR
> SMP Hop Ptr: 0x0
>
> ...
>
> Apr 13 12:19:51 971642 [453B6940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f
> Apr 13 12:19:51 971658 [453B6940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:19:51 971666 [453B6940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:19:52 969658 [45DB7940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:128 Producer:2 (Switch) from LID:10 TID:0x0000000000000190
> Apr 13 12:19:52 969671 [45DB7940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:19:52 969679 [45DB7940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:19:53 967681 [467B8940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f
> Apr 13 12:19:53 967696 [467B8940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:19:53 967704 [467B8940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:19:54 965697 [471B9940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:128 Producer:2 (Switch) from LID:10 TID:0x0000000000000190
> Apr 13 12:19:54 965710 [471B9940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:19:54 965717 [471B9940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:19:55 963717 [42BB2940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f
> Apr 13 12:19:55 963735 [42BB2940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:19:55 963743 [42BB2940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:19:56 961736 [435B3940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:128 Producer:2 (Switch) from LID:10 TID:0x0000000000000190
> Apr 13 12:19:56 961749 [435B3940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:19:56 961779 [435B3940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:19:57 959748 [43FB4940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f
> Apr 13 12:19:57 959771 [43FB4940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:19:57 959779 [43FB4940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:19:58 957770 [449B5940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:128 Producer:2 (Switch) from LID:10 TID:0x0000000000000190
> Apr 13 12:19:58 957788 [449B5940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:19:58 957795 [449B5940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:19:59 955793 [453B6940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f
> Apr 13 12:19:59 955806 [453B6940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:19:59 955813 [453B6940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:20:00 491524 [45DB7940] 0x01 -> __osm_mcmr_rcv_join_mgrp:
> ERR 1B11: method = SubnAdmSet, scope_state = 0x1, component mask =
> 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
> 0xff12401bffff0000 : 0x00000000ffffffff from port 0x0
> 24717124000029 (MT25408 IOSAN Fusion-IO)
> Apr 13 12:20:00 953808 [42BB2940] 0x01 ->
> __osm_trap_rcv_process_request: Received Generic Notice type:0x02
> num:128 Producer:2 (Switch) from LID:10 TID:0x0000000000000190
> Apr 13 12:20:00 953822 [42BB2940] 0x01 -> osm_get_physp_by_mad_addr:
> ERR 7503: Lid is out of range: 10
> Apr 13 12:20:00 953830 [42BB2940] 0x01 ->
> __osm_trap_rcv_process_request: ERR 3809: Failed to find source
> physical port for trap
> Apr 13 12:20:01 424318 [48FBC940] 0x01 -> umad_receiver: ERR 5409:
> send completed with error (method=0x1 attr=0x11 trans_id=0x5500001311)
> -- dropping
> Apr 13 12:20:01 424345 [48FBC940] 0x01 -> umad_receiver: ERR 5411: DR
> SMP Hop Ptr: 0x0
> Apr 13 12:20:01 424356 [48FBC940] 0x01 -> Received SMP on a 1 hop path:
>                                Initial path = 0,0
>                                Return path  = 0,0
> Apr 13 12:20:01 424366 [48FBC940] 0x01 ->
> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
> (IB_TIMEOUT)
> Apr 13 12:20:01 424410 [48FBC940] 0x01 -> SMP dump:
>                                base_ver................0x1
>                                mgmt_class..............0x81
>                                class_ver...............0x1
>                                method..................0x1 (SubnGet)
>                                D bit...................0x0
>                                status..................0x0
>                                hop_ptr.................0x0
>                                hop_count...............0x1
>                                trans_id................0x1311
>                                attr_id.................0x11 (NodeInfo)
>                                resv....................0x0
>                                attr_mod................0x0
>                                m_key...................0x0000000000000000
>                                dr_slid.................65535
>                                dr_dlid.................65535
>
>                                Initial path: 0,1
>                                Return path:  0,0
>                                Reserved:     [0][0][0][0][0][0][0]
>
>                                00 00 00 00 00 00 00 00   00 00 00 00
> 00 00 00 00
>
>                                00 00 00 00 00 00 00 00   00 00 00 00
> 00 00 00 00
>
>                                00 00 00 00 00 00 00 00   00 00 00 00
> 00 00 00 00
>
>                                00 00 00 00 00 00 00 00   00 00 00 00
> 00 00 00 00
>
>> Any end
>> node messages pertaining to IB ?
>
> Nothing I can see.
>
>>
>>> (I restarted ib on the
>>> clients), although ibstat shows the links up.  What am I getting
>>> wrong?  The opensmd is running on the server.
>>
>> Which server ?
>
> There's only one server... it has many ports for which I'm trying to
> partition do different clients.  So, in the above, when I say "Server
> A", I mean server port "A".

I meant which server port is running OpenSM (which GUID is being
used). I see above it is 0x24717124000029

>> You still need the default partition with the SM node being full and
>> the others being limited there (so it's also best to run SM on
>> separate node if possible otherwise you have the potential of any
>> client connecting to it on default partition).
>
> Are you saying to change the partitions.conf file to:
>
> part1=0x1, ipoib: 0x0024717124000029=full, 0x0002c90300028c01;
> part2=0x2, ipoib: 0x002471712400002a=full, 0x0002c90300026047;
> part3=0x3, ipoib: 0x0024717127000035=full, 0x0002c90300026053;
> part4=0x4, ipoib: 0x0024717127000036=full, 0x0002c9030002603b;

That's part of it.

> ... (which still doesn't work) in which case I set all the server's
> ports to "full", or should just one be "full" (which didn't work
> either)?

You also need:
Default=0x7fff: ALL, SELF=FULL;
I would put that first.

> I did have a difficult time understanding the difference between
> "full" and "limited" in the man page.

On a given partition, full can talk with all other members whereas a
limited member can only talk with full members (not other limited
members).

> I've got a captive network, so I don't want any paths I've not
> specified to be allowed.  If that makes any sense.  So, I didn't want
> to put a statement in like:
>
> Default=0x7fff,ipoib:ALL=full;
>
> ... that would let a rogue node slip through the cracks.

The only one they can talk with is the SM (the way I'm proposing) so
it's best if the SM node could be separate.

In order for SA portion of SM to work, SM node must be a full member
of the default partition and other nodes must be at least limited
members (so their queries will be responded to). IPoIB is not needed
on that partition.

-- Hal

> Thanks,
>
> Chris
>>
>> -- Hal
>>
>>> Thanks,
>>>
>>> Chris
>>>
>>
>



More information about the general mailing list