[Users] Troubles with ibsim
Hal Rosenstock
hal.rosenstock at gmail.com
Thu Feb 15 14:27:27 PST 2018
The option I mentioned is only in MLNX OpenSM (and not in upstream OpenSM)
so is a noop on upstream OpenSM. It was intended to bypass the routine that
was indicating the failure (osm_hm_set_by_physp).
Anyhow, glad this got you past the roadblock.
On Thu, Feb 15, 2018 at 5:17 PM, Tim Miller <btmiller at helix.nih.gov> wrote:
> Hi Hal,
>
> Thanks - oddly enough, when I switched to ibsim 0.7 compiled from git and
> opensm from the latest vanilla openfabrics.org release, everything seems
> to work fine (opensm reports "SUBNET UP", and the simulated switches appear
> to have valid LFTs). Could this default have changed between the two
> versions of opensm?
>
> Regards,
> Tim
>
> On 02/15/2018 05:13 PM, Hal Rosenstock wrote:
>
>> In opensm.conf, try setting
>>
>> hm_unhealthy_ports_checks FALSE
>>
>> it defaults to TRUE
>>
>>
>> On Thu, Feb 15, 2018 at 4:22 PM, Tim Miller <btmiller at helix.nih.gov
>> <mailto:btmiller at helix.nih.gov>> wrote:
>>
>> Hi Hal,
>>
>> I'd have to see about sending you the whole netdiscover file. I
>> could probably strip down the topology for debugging, as our We're
>> running MOFED 4.2-1.0.0.
>>
>> Here's a result of the a dump of that switch GUID from the
>> simulator console - it appears that it's considering port 0 to be
>> the SMA port:
>>
>> sim> dump "S-7cfe900300b49890"
>> # Net status - Thu Feb 15 16:04:23 2018
>>
>> Switch 36 "S-7cfe900300b49890" nodeguid 7cfe900300b49890
>> sysimgguid 7cfe900300b49890
>> # linearcap 30720 FDBtop 0 portchange 1
>> 7cfe900300b49890 [0] "Sma Port"[0] lid 1309 lmc 0
>> smlid 0 4x 2.5G Active/LinkUp
>> 7cfe900300b49890 [1] "H-24be05ffffa8e4c0"[1] 4x FDR10
>> Init/LinkUp
>> 7cfe900300b49890 [2] "H-24be05ffffa8f410"[1] 4x FDR10
>> Init/LinkUp
>> 7cfe900300b49890 [3] "H-24be05ffffa82580"[1] 4x FDR10
>> Init/LinkUp
>> ...
>> 7cfe900300b49890 [36] "H-24be05ffffa85500"[1] 4x FDR10
>> Init/LinkUp
>> # dumped 1 nodes
>>
>> Regards,
>> Tim
>>
>> On 02/15/2018 04:01 PM, Hal Rosenstock wrote:
>>
>> Are you sure you're using MLNX ibsim with MLNX OpenSM ? It
>> looks like MLNX ibsim supports 0xff17 to me so the message
>> "process_packet: no one to handle pkt: class 0x81, attr
>> 0xff17" shouldn't come out.
>>
>> Can you send me your ibnetdiscover file that is used as input
>> to ibsim ? Maybe the real problem is:
>> osm_hm_set_by_physp: Remote port of 0x7cfe900300b49890[0]
>> couldn't be found
>> That looks like remote port to some switch port 0 which looks
>> odd to me as switch port 0 has no peer port and it shouldn't
>> be looking for one. Which MLNX OpenSM version ?
>>
>>
>> On Thu, Feb 15, 2018 at 3:50 PM, Tim Miller
>> <btmiller at helix.nih.gov <mailto:btmiller at helix.nih.gov>
>> <mailto:btmiller at helix.nih.gov
>> <mailto:btmiller at helix.nih.gov>>> wrote:
>>
>> Hi Hal,
>>
>> Thanks for looking into this. You're indeed correct that
>> I'm using
>> an MLNX OFED ibsim (and opensm for that matter). I could try
>> running both from a vanilla OpenFabrics release and see if
>> I have
>> any better luck; let me go ahead and try that...
>>
>> Regards,
>> Tim
>>
>> On 02/15/2018 03:39 PM, Hal Rosenstock wrote:
>>
>> Just checked. MLNX OFED ibsim supports 0xff17 attribute.
>>
>> On Thu, Feb 15, 2018 at 3:36 PM, Hal Rosenstock
>> <hal.rosenstock at gmail.com
>> <mailto:hal.rosenstock at gmail.com>
>> <mailto:hal.rosenstock at gmail.com
>> <mailto:hal.rosenstock at gmail.com>>
>> <mailto:hal.rosenstock at gmail.com
>> <mailto:hal.rosenstock at gmail.com>
>> <mailto:hal.rosenstock at gmail.com
>> <mailto:hal.rosenstock at gmail.com>>>> wrote:
>>
>> Hi Tim,
>>
>> Attribute ID 0xff17 is in the vendor specific
>> range for SM
>> attributes and not supported with (at least) the
>> upstream
>> ibsim.
>>
>> I think you are using MLNX OpenSM rather than
>> upstream or OFED
>> OpenSM with the upstream ibsim. I'm not sure if
>> MLNX ibsim
>> supports the additional vendor specific SM
>> attributes or not.
>>
>> Can you work with some upstream or OFED OpenSM or only
>> MLNX OpenSM
>> ? If not, I try to find out whether using the MLNX
>> OFED ibsim
>> supports the additional attributes for running
>> MLNX OpenSM.
>>
>> -- Hal
>>
>>
>> On Thu, Feb 15, 2018 at 11:50 AM, Tim Miller
>> <btmiller at helix.nih.gov
>> <mailto:btmiller at helix.nih.gov> <mailto:btmiller at helix.nih.gov
>> <mailto:btmiller at helix.nih.gov>>
>> <mailto:btmiller at helix.nih.gov
>> <mailto:btmiller at helix.nih.gov>
>>
>> <mailto:btmiller at helix.nih.gov
>> <mailto:btmiller at helix.nih.gov>>>> wrote:
>>
>> I am attempting to use ibsim to test some possible
>> configuration changes in our routing, but I am
>> running
>> into
>> some difficulties. I can get the simulator
>> started,
>> but opensm
>> fails to discover the fabric in the simulated
>> environment. It
>> discovers the switch to which the host running
>> opensm is
>> connected, but it can't discover any further than
>> that. In the
>> opensm log, I see:
>>
>> Feb 14 16:31:50 047307 [AD332700] 0x04 ->
>> ni_rcv_process_new:
>> Discovered new Switch node,
>> GUID 0x7cfe900300b49890, TID 0x1239
>> Feb 14 16:31:50 047821 [AD533700] 0x04 ->
>> nd_rcv_process_nd:
>> Node 0x7cfe900300b49890
>> Description = SwitchIB Mellanox Technologies
>> Feb 14 16:31:50 047847 [B5974700] 0x01 ->
>> log_send_error: ERR
>> 5411: DR SMP Send completed with error
>> (IB_TIMEOUT) --
>> dropping
>> Method 0x1, Attr
>> 0xFF17, TID
>> 0x123b
>> Feb 14 16:31:50 047866 [B5974700] 0x01 ->
>> Received SMP
>> on a 1
>> hop path: Initial path = 0,1, Return path = 0,0
>> Feb 14 16:31:50 047893 [B5974700] 0x01 ->
>> sm_mad_ctrl_send_err_cb: ERR 3113: MAD
>> completed in error
>> (IB_TIMEOUT): SubnGet(GeneralInfo), attr_mod
>> 0x4, TID
>> 0x123b
>> Feb 14 16:31:50 047913 [B5974700] 0x04 ->
>> osm_hm_set_by_physp:
>> Remote port of 0x7cfe900300b49890[0] couldn't
>> be found
>> Feb 14 16:31:50 047921 [B5974700] 0x01 ->
>> sm_mad_ctrl_send_err_cb: ERR 3120: Timeout
>> while getting
>> attribute 0xFF17 (GeneralInfo); Possible
>> mis-set mkey?
>> Feb 14 16:31:50 047927 [B5974700] 0x01 ->
>> sm_mad_ctrl_send_err_cb: Error during
>> initialization: got
>> General Info time out from node 0x7cfe900300b49890
>>
>> And in the simulator console, I see messages
>> of the form.
>>
>> ibwarn: [32331] process_packet: no one to
>> handle pkt:
>> class
>> 0x81, attr 0xff17
>>
>> Looking at the output of the "dump" command from
>> within the
>> console, it shows that all ports are in
>> Init/LinkUp,
>> except
>> for the SMA port, which is in state Active/LinkUp.
>>
>> Does anyone have any idea what I might be
>> doing wrong
>> here?
>>
>> Thanks,
>> Tim
>>
>> -- Tim Miller
>> NIH HPC systems staff
>> 301-827-5261 <tel:301-827-5261> <tel:301-827-5261
>> <tel:301-827-5261>> <tel:301-827-5261 <tel:301-827-5261>
>> <tel:301-827-5261 <tel:301-827-5261>>>
>>
>> https://hpc.nih.gov
>>
>> _______________________________________________
>> Users mailing list
>> Users at lists.openfabrics.org <mailto:Users at lists.openfabrics.org>
>> <mailto:Users at lists.openfabrics.org
>> <mailto:Users at lists.openfabrics.org>>
>> <mailto:Users at lists.openfabrics.org
>> <mailto:Users at lists.openfabrics.org>
>> <mailto:Users at lists.openfabrics.org
>> <mailto:Users at lists.openfabrics.org>>>
>> http://lists.openfabrics.org/mailman/listinfo/users
>> <http://lists.openfabrics.org/mailman/listinfo/users>
>> <http://lists.openfabrics.org/mailman/listinfo/users
>> <http://lists.openfabrics.org/mailman/listinfo/users>>
>>
>> <http://lists.openfabrics.org/mailman/listinfo/users
>> <http://lists.openfabrics.org/mailman/listinfo/users>
>> <http://lists.openfabrics.org/mailman/listinfo/users
>> <http://lists.openfabrics.org/mailman/listinfo/users>>>
>>
>>
>>
>>
>> -- Tim Miller
>> NIH HPC systems staff
>> 301-827-5261 <tel:301-827-5261> <tel:301-827-5261
>> <tel:301-827-5261>>
>> https://hpc.nih.gov
>>
>>
>>
>> -- Tim Miller
>> NIH HPC systems staff
>> 301-827-5261 <tel:301-827-5261>
>> https://hpc.nih.gov
>>
>>
>>
> --
> Tim Miller
> NIH HPC systems staff
> 301-827-5261
> https://hpc.nih.gov
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20180215/caa61ebe/attachment.html>
More information about the Users
mailing list