[Users] Troubles with ibsim

Hal Rosenstock hal.rosenstock at gmail.com
Thu Feb 15 14:27:27 PST 2018


The option I mentioned is only in MLNX OpenSM (and not in upstream OpenSM)
so is a noop on upstream OpenSM. It was intended to bypass the routine that
was indicating the failure (osm_hm_set_by_physp).

Anyhow, glad this got you past the roadblock.

On Thu, Feb 15, 2018 at 5:17 PM, Tim Miller <btmiller at helix.nih.gov> wrote:

> Hi Hal,
>
> Thanks - oddly enough, when I switched to ibsim 0.7 compiled from git and
> opensm from the latest vanilla openfabrics.org release, everything seems
> to work fine (opensm reports "SUBNET UP", and the simulated switches appear
> to have valid LFTs). Could this default have changed between the two
> versions of opensm?
>
> Regards,
> Tim
>
> On 02/15/2018 05:13 PM, Hal Rosenstock wrote:
>
>> In opensm.conf, try setting
>>
>> hm_unhealthy_ports_checks FALSE
>>
>> it defaults to TRUE
>>
>>
>> On Thu, Feb 15, 2018 at 4:22 PM, Tim Miller <btmiller at helix.nih.gov
>> <mailto:btmiller at helix.nih.gov>> wrote:
>>
>>     Hi Hal,
>>
>>     I'd have to see about sending you the whole netdiscover file. I
>>     could probably strip down the topology for debugging, as our We're
>>     running MOFED 4.2-1.0.0.
>>
>>     Here's a result of the a dump of that switch GUID from the
>>     simulator console - it appears that it's considering port 0 to be
>>     the SMA port:
>>
>>     sim> dump "S-7cfe900300b49890"
>>     # Net status - Thu Feb 15 16:04:23 2018
>>
>>     Switch 36 "S-7cfe900300b49890"  nodeguid 7cfe900300b49890
>>     sysimgguid 7cfe900300b49890
>>     #       linearcap 30720 FDBtop 0 portchange 1
>>     7cfe900300b49890        [0]     "Sma Port"[0]    lid 1309 lmc 0
>>     smlid 0  4x  2.5G Active/LinkUp
>>     7cfe900300b49890        [1]     "H-24be05ffffa8e4c0"[1]   4x FDR10
>>     Init/LinkUp
>>     7cfe900300b49890        [2]     "H-24be05ffffa8f410"[1]   4x FDR10
>>     Init/LinkUp
>>     7cfe900300b49890        [3]     "H-24be05ffffa82580"[1]   4x FDR10
>>     Init/LinkUp
>>     ...
>>     7cfe900300b49890        [36]    "H-24be05ffffa85500"[1]   4x FDR10
>>     Init/LinkUp
>>     #  dumped 1 nodes
>>
>>     Regards,
>>     Tim
>>
>>     On 02/15/2018 04:01 PM, Hal Rosenstock wrote:
>>
>>         Are you sure you're using MLNX ibsim with MLNX OpenSM ? It
>>         looks like MLNX ibsim supports 0xff17 to me so the message
>>         "process_packet: no one to handle pkt: class 0x81, attr
>>         0xff17" shouldn't come out.
>>
>>         Can you send me your ibnetdiscover file that is used as input
>>         to ibsim ? Maybe the real problem is:
>>         osm_hm_set_by_physp: Remote port of 0x7cfe900300b49890[0]
>>         couldn't be found
>>         That looks like remote port to some switch port 0 which looks
>>         odd to me as switch port 0 has no peer port and it shouldn't
>>         be looking for one. Which MLNX OpenSM version ?
>>
>>
>>         On Thu, Feb 15, 2018 at 3:50 PM, Tim Miller
>>         <btmiller at helix.nih.gov <mailto:btmiller at helix.nih.gov>
>>         <mailto:btmiller at helix.nih.gov
>>         <mailto:btmiller at helix.nih.gov>>> wrote:
>>
>>             Hi Hal,
>>
>>             Thanks for looking into this. You're indeed correct that
>>         I'm using
>>             an MLNX OFED ibsim (and opensm for that matter). I could try
>>             running both from a vanilla OpenFabrics release and see if
>>         I have
>>             any better luck; let me go ahead and try that...
>>
>>             Regards,
>>             Tim
>>
>>             On 02/15/2018 03:39 PM, Hal Rosenstock wrote:
>>
>>                 Just checked. MLNX OFED ibsim supports 0xff17 attribute.
>>
>>                 On Thu, Feb 15, 2018 at 3:36 PM, Hal Rosenstock
>>                 <hal.rosenstock at gmail.com
>>         <mailto:hal.rosenstock at gmail.com>
>>         <mailto:hal.rosenstock at gmail.com
>>         <mailto:hal.rosenstock at gmail.com>>
>>                 <mailto:hal.rosenstock at gmail.com
>>         <mailto:hal.rosenstock at gmail.com>
>>                 <mailto:hal.rosenstock at gmail.com
>>         <mailto:hal.rosenstock at gmail.com>>>> wrote:
>>
>>                     Hi Tim,
>>
>>                     Attribute ID 0xff17 is in the vendor specific
>>         range for SM
>>                     attributes and not supported with (at least) the
>>         upstream
>>                 ibsim.
>>
>>                     I think you are using MLNX OpenSM rather than
>>         upstream or OFED
>>                     OpenSM with the upstream ibsim. I'm not sure if
>>         MLNX ibsim
>>                     supports the additional vendor specific SM
>>         attributes or not.
>>
>>                     Can you work with some upstream or OFED OpenSM or only
>>                 MLNX OpenSM
>>                     ? If not, I try to find out whether using the MLNX
>>         OFED ibsim
>>                     supports the additional attributes for running
>>         MLNX OpenSM.
>>
>>                     -- Hal
>>
>>
>>                     On Thu, Feb 15, 2018 at 11:50 AM, Tim Miller
>>                     <btmiller at helix.nih.gov
>>         <mailto:btmiller at helix.nih.gov> <mailto:btmiller at helix.nih.gov
>>         <mailto:btmiller at helix.nih.gov>>
>>                 <mailto:btmiller at helix.nih.gov
>>         <mailto:btmiller at helix.nih.gov>
>>
>>                 <mailto:btmiller at helix.nih.gov
>>         <mailto:btmiller at helix.nih.gov>>>> wrote:
>>
>>                         I am attempting to use ibsim to test some possible
>>                         configuration changes in our routing, but I am
>>         running
>>                 into
>>                         some difficulties. I can get the simulator
>>         started,
>>                 but opensm
>>                         fails to discover the fabric in the simulated
>>                 environment. It
>>                         discovers the switch to which the host running
>>         opensm is
>>                         connected, but it can't discover any further than
>>                 that. In the
>>                         opensm log, I see:
>>
>>                         Feb 14 16:31:50 047307 [AD332700] 0x04 ->
>>                 ni_rcv_process_new:
>>                         Discovered new Switch node,
>>                           GUID 0x7cfe900300b49890, TID 0x1239
>>                         Feb 14 16:31:50 047821 [AD533700] 0x04 ->
>>                 nd_rcv_process_nd:
>>                         Node 0x7cfe900300b49890
>>                           Description = SwitchIB Mellanox Technologies
>>                         Feb 14 16:31:50 047847 [B5974700] 0x01 ->
>>                 log_send_error: ERR
>>                         5411: DR SMP Send completed with error
>>         (IB_TIMEOUT) --
>>                 dropping
>>                                                 Method 0x1, Attr
>>         0xFF17, TID
>>                 0x123b
>>                         Feb 14 16:31:50 047866 [B5974700] 0x01 ->
>>         Received SMP
>>                 on a 1
>>                         hop path: Initial path = 0,1, Return path  = 0,0
>>                         Feb 14 16:31:50 047893 [B5974700] 0x01 ->
>>                         sm_mad_ctrl_send_err_cb: ERR 3113: MAD
>>         completed in error
>>                         (IB_TIMEOUT): SubnGet(GeneralInfo), attr_mod
>>         0x4, TID
>>                 0x123b
>>                         Feb 14 16:31:50 047913 [B5974700] 0x04 ->
>>                 osm_hm_set_by_physp:
>>                         Remote port of 0x7cfe900300b49890[0] couldn't
>>         be found
>>                         Feb 14 16:31:50 047921 [B5974700] 0x01 ->
>>                         sm_mad_ctrl_send_err_cb: ERR 3120: Timeout
>>         while getting
>>                         attribute 0xFF17 (GeneralInfo); Possible
>>         mis-set mkey?
>>                         Feb 14 16:31:50 047927 [B5974700] 0x01 ->
>>                         sm_mad_ctrl_send_err_cb: Error during
>>         initialization: got
>>                         General Info time out from node 0x7cfe900300b49890
>>
>>                         And in the simulator console, I see messages
>>         of the form.
>>
>>                         ibwarn: [32331] process_packet: no one to
>>         handle pkt:
>>                 class
>>                         0x81, attr 0xff17
>>
>>                         Looking at the output of the "dump" command from
>>                 within the
>>                         console, it shows that all ports are in
>>         Init/LinkUp,
>>                 except
>>                         for the SMA port, which is in state Active/LinkUp.
>>
>>                         Does anyone have any idea what I might be
>>         doing wrong
>>                 here?
>>
>>                         Thanks,
>>                         Tim
>>
>>                         --         Tim Miller
>>                         NIH HPC systems staff
>>         301-827-5261 <tel:301-827-5261> <tel:301-827-5261
>>         <tel:301-827-5261>> <tel:301-827-5261 <tel:301-827-5261>
>>                 <tel:301-827-5261 <tel:301-827-5261>>>
>>
>>         https://hpc.nih.gov
>>
>>                         _______________________________________________
>>                         Users mailing list
>>         Users at lists.openfabrics.org <mailto:Users at lists.openfabrics.org>
>>                 <mailto:Users at lists.openfabrics.org
>>         <mailto:Users at lists.openfabrics.org>>
>>                 <mailto:Users at lists.openfabrics.org
>>         <mailto:Users at lists.openfabrics.org>
>>                 <mailto:Users at lists.openfabrics.org
>>         <mailto:Users at lists.openfabrics.org>>>
>>         http://lists.openfabrics.org/mailman/listinfo/users
>>         <http://lists.openfabrics.org/mailman/listinfo/users>
>>                 <http://lists.openfabrics.org/mailman/listinfo/users
>>         <http://lists.openfabrics.org/mailman/listinfo/users>>
>>
>>         <http://lists.openfabrics.org/mailman/listinfo/users
>>         <http://lists.openfabrics.org/mailman/listinfo/users>
>>                 <http://lists.openfabrics.org/mailman/listinfo/users
>>         <http://lists.openfabrics.org/mailman/listinfo/users>>>
>>
>>
>>
>>
>>             --     Tim Miller
>>             NIH HPC systems staff
>>         301-827-5261 <tel:301-827-5261> <tel:301-827-5261
>>         <tel:301-827-5261>>
>>         https://hpc.nih.gov
>>
>>
>>
>>     --     Tim Miller
>>     NIH HPC systems staff
>>     301-827-5261 <tel:301-827-5261>
>>     https://hpc.nih.gov
>>
>>
>>
> --
> Tim Miller
> NIH HPC systems staff
> 301-827-5261
> https://hpc.nih.gov
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20180215/caa61ebe/attachment.html>


More information about the Users mailing list