[Users] Troubles with ibsim

Tim Miller btmiller at helix.nih.gov
Thu Feb 15 14:17:44 PST 2018


Hi Hal,

Thanks - oddly enough, when I switched to ibsim 0.7 compiled from git 
and opensm from the latest vanilla openfabrics.org release, everything 
seems to work fine (opensm reports "SUBNET UP", and the simulated 
switches appear to have valid LFTs). Could this default have changed 
between the two versions of opensm?

Regards,
Tim

On 02/15/2018 05:13 PM, Hal Rosenstock wrote:
> In opensm.conf, try setting
>
> hm_unhealthy_ports_checks FALSE
>
> it defaults to TRUE
>
>
> On Thu, Feb 15, 2018 at 4:22 PM, Tim Miller <btmiller at helix.nih.gov 
> <mailto:btmiller at helix.nih.gov>> wrote:
>
>     Hi Hal,
>
>     I'd have to see about sending you the whole netdiscover file. I
>     could probably strip down the topology for debugging, as our We're
>     running MOFED 4.2-1.0.0.
>
>     Here's a result of the a dump of that switch GUID from the
>     simulator console - it appears that it's considering port 0 to be
>     the SMA port:
>
>     sim> dump "S-7cfe900300b49890"
>     # Net status - Thu Feb 15 16:04:23 2018
>
>     Switch 36 "S-7cfe900300b49890"  nodeguid 7cfe900300b49890
>     sysimgguid 7cfe900300b49890
>     #       linearcap 30720 FDBtop 0 portchange 1
>     7cfe900300b49890        [0]     "Sma Port"[0]    lid 1309 lmc 0
>     smlid 0  4x  2.5G Active/LinkUp
>     7cfe900300b49890        [1]     "H-24be05ffffa8e4c0"[1]   4x FDR10
>     Init/LinkUp
>     7cfe900300b49890        [2]     "H-24be05ffffa8f410"[1]   4x FDR10
>     Init/LinkUp
>     7cfe900300b49890        [3]     "H-24be05ffffa82580"[1]   4x FDR10
>     Init/LinkUp
>     ...
>     7cfe900300b49890        [36]    "H-24be05ffffa85500"[1]   4x FDR10
>     Init/LinkUp
>     #  dumped 1 nodes
>
>     Regards,
>     Tim
>
>     On 02/15/2018 04:01 PM, Hal Rosenstock wrote:
>
>         Are you sure you're using MLNX ibsim with MLNX OpenSM ? It
>         looks like MLNX ibsim supports 0xff17 to me so the message
>         "process_packet: no one to handle pkt: class 0x81, attr
>         0xff17" shouldn't come out.
>
>         Can you send me your ibnetdiscover file that is used as input
>         to ibsim ? Maybe the real problem is:
>         osm_hm_set_by_physp: Remote port of 0x7cfe900300b49890[0]
>         couldn't be found
>         That looks like remote port to some switch port 0 which looks
>         odd to me as switch port 0 has no peer port and it shouldn't
>         be looking for one. Which MLNX OpenSM version ?
>
>
>         On Thu, Feb 15, 2018 at 3:50 PM, Tim Miller
>         <btmiller at helix.nih.gov <mailto:btmiller at helix.nih.gov>
>         <mailto:btmiller at helix.nih.gov
>         <mailto:btmiller at helix.nih.gov>>> wrote:
>
>             Hi Hal,
>
>             Thanks for looking into this. You're indeed correct that
>         I'm using
>             an MLNX OFED ibsim (and opensm for that matter). I could try
>             running both from a vanilla OpenFabrics release and see if
>         I have
>             any better luck; let me go ahead and try that...
>
>             Regards,
>             Tim
>
>             On 02/15/2018 03:39 PM, Hal Rosenstock wrote:
>
>                 Just checked. MLNX OFED ibsim supports 0xff17 attribute.
>
>                 On Thu, Feb 15, 2018 at 3:36 PM, Hal Rosenstock
>                 <hal.rosenstock at gmail.com
>         <mailto:hal.rosenstock at gmail.com>
>         <mailto:hal.rosenstock at gmail.com
>         <mailto:hal.rosenstock at gmail.com>>
>                 <mailto:hal.rosenstock at gmail.com
>         <mailto:hal.rosenstock at gmail.com>
>                 <mailto:hal.rosenstock at gmail.com
>         <mailto:hal.rosenstock at gmail.com>>>> wrote:
>
>                     Hi Tim,
>
>                     Attribute ID 0xff17 is in the vendor specific
>         range for SM
>                     attributes and not supported with (at least) the
>         upstream
>                 ibsim.
>
>                     I think you are using MLNX OpenSM rather than
>         upstream or OFED
>                     OpenSM with the upstream ibsim. I'm not sure if
>         MLNX ibsim
>                     supports the additional vendor specific SM
>         attributes or not.
>
>                     Can you work with some upstream or OFED OpenSM or only
>                 MLNX OpenSM
>                     ? If not, I try to find out whether using the MLNX
>         OFED ibsim
>                     supports the additional attributes for running
>         MLNX OpenSM.
>
>                     -- Hal
>
>
>                     On Thu, Feb 15, 2018 at 11:50 AM, Tim Miller
>                     <btmiller at helix.nih.gov
>         <mailto:btmiller at helix.nih.gov> <mailto:btmiller at helix.nih.gov
>         <mailto:btmiller at helix.nih.gov>>
>                 <mailto:btmiller at helix.nih.gov
>         <mailto:btmiller at helix.nih.gov>
>
>                 <mailto:btmiller at helix.nih.gov
>         <mailto:btmiller at helix.nih.gov>>>> wrote:
>
>                         I am attempting to use ibsim to test some possible
>                         configuration changes in our routing, but I am
>         running
>                 into
>                         some difficulties. I can get the simulator
>         started,
>                 but opensm
>                         fails to discover the fabric in the simulated
>                 environment. It
>                         discovers the switch to which the host running
>         opensm is
>                         connected, but it can't discover any further than
>                 that. In the
>                         opensm log, I see:
>
>                         Feb 14 16:31:50 047307 [AD332700] 0x04 ->
>                 ni_rcv_process_new:
>                         Discovered new Switch node,
>                           GUID 0x7cfe900300b49890, TID 0x1239
>                         Feb 14 16:31:50 047821 [AD533700] 0x04 ->
>                 nd_rcv_process_nd:
>                         Node 0x7cfe900300b49890
>                           Description = SwitchIB Mellanox Technologies
>                         Feb 14 16:31:50 047847 [B5974700] 0x01 ->
>                 log_send_error: ERR
>                         5411: DR SMP Send completed with error
>         (IB_TIMEOUT) --
>                 dropping
>                                                 Method 0x1, Attr
>         0xFF17, TID
>                 0x123b
>                         Feb 14 16:31:50 047866 [B5974700] 0x01 ->
>         Received SMP
>                 on a 1
>                         hop path: Initial path = 0,1, Return path  = 0,0
>                         Feb 14 16:31:50 047893 [B5974700] 0x01 ->
>                         sm_mad_ctrl_send_err_cb: ERR 3113: MAD
>         completed in error
>                         (IB_TIMEOUT): SubnGet(GeneralInfo), attr_mod
>         0x4, TID
>                 0x123b
>                         Feb 14 16:31:50 047913 [B5974700] 0x04 ->
>                 osm_hm_set_by_physp:
>                         Remote port of 0x7cfe900300b49890[0] couldn't
>         be found
>                         Feb 14 16:31:50 047921 [B5974700] 0x01 ->
>                         sm_mad_ctrl_send_err_cb: ERR 3120: Timeout
>         while getting
>                         attribute 0xFF17 (GeneralInfo); Possible
>         mis-set mkey?
>                         Feb 14 16:31:50 047927 [B5974700] 0x01 ->
>                         sm_mad_ctrl_send_err_cb: Error during
>         initialization: got
>                         General Info time out from node 0x7cfe900300b49890
>
>                         And in the simulator console, I see messages
>         of the form.
>
>                         ibwarn: [32331] process_packet: no one to
>         handle pkt:
>                 class
>                         0x81, attr 0xff17
>
>                         Looking at the output of the "dump" command from
>                 within the
>                         console, it shows that all ports are in
>         Init/LinkUp,
>                 except
>                         for the SMA port, which is in state Active/LinkUp.
>
>                         Does anyone have any idea what I might be
>         doing wrong
>                 here?
>
>                         Thanks,
>                         Tim
>
>                         --         Tim Miller
>                         NIH HPC systems staff
>         301-827-5261 <tel:301-827-5261> <tel:301-827-5261
>         <tel:301-827-5261>> <tel:301-827-5261 <tel:301-827-5261>
>                 <tel:301-827-5261 <tel:301-827-5261>>>
>         https://hpc.nih.gov
>
>                         _______________________________________________
>                         Users mailing list
>         Users at lists.openfabrics.org <mailto:Users at lists.openfabrics.org>
>                 <mailto:Users at lists.openfabrics.org
>         <mailto:Users at lists.openfabrics.org>>
>                 <mailto:Users at lists.openfabrics.org
>         <mailto:Users at lists.openfabrics.org>
>                 <mailto:Users at lists.openfabrics.org
>         <mailto:Users at lists.openfabrics.org>>>
>         http://lists.openfabrics.org/mailman/listinfo/users
>         <http://lists.openfabrics.org/mailman/listinfo/users>
>                 <http://lists.openfabrics.org/mailman/listinfo/users
>         <http://lists.openfabrics.org/mailman/listinfo/users>>
>                        
>         <http://lists.openfabrics.org/mailman/listinfo/users
>         <http://lists.openfabrics.org/mailman/listinfo/users>
>                 <http://lists.openfabrics.org/mailman/listinfo/users
>         <http://lists.openfabrics.org/mailman/listinfo/users>>>
>
>
>
>
>             --     Tim Miller
>             NIH HPC systems staff
>         301-827-5261 <tel:301-827-5261> <tel:301-827-5261
>         <tel:301-827-5261>>
>         https://hpc.nih.gov
>
>
>
>     -- 
>     Tim Miller
>     NIH HPC systems staff
>     301-827-5261 <tel:301-827-5261>
>     https://hpc.nih.gov
>
>

-- 
Tim Miller
NIH HPC systems staff
301-827-5261
https://hpc.nih.gov




More information about the Users mailing list