[openib-general] OpenSM (again)
Roland Fehrenbacher
rf at q-leap.de
Tue Apr 12 09:46:59 PDT 2005
>>>>> "Hal" == Hal Rosenstock <halr at voltaire.com> writes:
>> Problem: When I reboot all the 40 nodes (apart from the one the
>> opensm is running), the network is non-functional (no pings go
>> through, even though ports show status "Active") for quite a
>> while (more than 10 minutes) after all the nodes have come
>> up. It then recovers without intervention. Is this normal?
>> Single node reboots don't affect the network operation. osm Log
>> file is appended.
Hal> Can you describe your topology ? Is it the following: the SM
Hal> is connected to a switch/or switches with the 40 nodes
Hal> connected off these switches ?
Yes, the 40 nodes are connected to a single 144 port switch.
Hal> I'll respond to the log (and these questions) in a separate
Hal> email response.
>> Question 1: Can I run opensm in a master slave configuration?
Hal> Yes. Others are doing this.
>> I noticed that there is a priority commandline option, but am
>> not sure how to apply this.
Hal> SM election occurs per high priority low GUID. So if you
Hal> don't care which SM is the master than you don't need to do
Hal> anything. If you want a specific order (and it is not in GUID
Hal> order) then you need to specify priority.
Ok. I tried this, specifying priority 0 on one server, and priority 15
on another one. I assume priority 15, will be the master.
If I first start the priority 0 opensm, and then the priority 15 one,
things look normal: Log excerpts
priority 0 server
Apr 12 18:41:06 [4000] -> OpenSM Rev:openib-1.0.0
Apr 12 18:41:06 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:41:06 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c2.
Apr 12 18:41:06 [4000] -> osm_vendor_bind: Binding to port 0x2c902004013c2.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001 TID:0x0000000000000011
Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0001 GID:0xfe80000000000000,0x0002c902004013c2
Apr 12 18:41:06 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000d
Apr 12 18:41:06 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a
Apr 12 18:42:25 [18007] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0002 TID:0x000000000000000e
Apr 12 18:42:25 [18007] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0002 GID:0xfe80000000000000,0x0002c9020040133a
priority 15 server
Apr 12 18:42:25 [4000] -> OpenSM Rev:openib-1.0.0
Apr 12 18:42:25 [4000] -> osm_opensm_init: Forcing single threaded dispatcher.
Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:42:25 [4000] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000
Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port 0x2c9020040133a.
Apr 12 18:42:25 [4000] -> osm_vendor_bind: Binding to port 0x2c9020040133a.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
Apr 12 18:42:25 [18007] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:Received an Invalid Delete Request.
When I kill the priority 15 server however, the priority 0 server runs
amok with continous log messages like:
Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with error(method=1 attr=20) -- dropping.
Apr 12 18:44:28 [2400A] -> umad_receiver: send completed with error(method=1 attr=20) -- dropping.
I assume that the handover to the priority 0 opensm hasn't worked
then. For additional information: This test was done on a
point-to-point connection between 2 adapters.
Roland
More information about the general
mailing list