[openib-general] OpenSM (again)

Roland Fehrenbacher rf at q-leap.de
Tue Apr 12 09:46:59 PDT 2005


>>>>> "Hal" == Hal Rosenstock <halr at voltaire.com> writes:

    >> Problem: When I reboot all the 40 nodes (apart from the one the
    >> opensm is running), the network is non-functional (no pings go
    >> through, even though ports show status "Active") for quite a
    >> while (more than 10 minutes) after all the nodes have come
    >> up. It then recovers without intervention. Is this normal?
    >> Single node reboots don't affect the network operation. osm Log
    >> file is appended.

    Hal> Can you describe your topology ? Is it the following: the SM
    Hal> is connected to a switch/or switches with the 40 nodes
    Hal> connected off these switches ?

Yes, the 40 nodes are connected to a single 144 port switch.

    Hal> I'll respond to the log (and these questions) in a separate
    Hal> email response.

    >> Question 1: Can I run opensm in a master slave configuration?

    Hal> Yes. Others are doing this.

    >> I noticed that there is a priority commandline option, but am
    >> not sure how to apply this.

    Hal> SM election occurs per high priority low GUID. So if you
    Hal> don't care which SM is the master than you don't need to do
    Hal> anything. If you want a specific order (and it is not in GUID
    Hal> order) then you need to specify priority.

Ok. I tried this, specifying priority 0 on one server, and priority 15
on another one. I assume priority 15, will be the master.
If I first start the priority 0 opensm, and then the priority 15 one,
things look normal: Log excerpts

priority 0 server

Apr 12 18:41:06 [4000] -> OpenSM Rev:openib-1.0.0
Apr 12 18:41:06 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c2.
Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c2.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001 TID:0x0000000000000011
Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0001 GID:0xfe80000000000000,0x0002c902004013c2
Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000d
Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a
Apr 12 18:42:25 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000e
Apr 12 18:42:25 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a

priority 15 server

Apr 12 18:42:25 [4000] -> OpenSM Rev:openib-1.0.0
Apr 12 18:42:25 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port 0x2c9020040133a.
Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port 0x2c9020040133a.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.

When I kill the priority 15 server however, the priority 0 server runs
amok with continous log messages like:

Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with error(method=1 attr=20) -- dropping.
Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with error(method=1 attr=20) -- dropping.

I assume that the handover to the priority 0 opensm hasn't worked
then. For additional information: This test was done on a
point-to-point connection between 2 adapters.

Roland




More information about the general mailing list