[ofa-general] Lost in-service traps during Open SM migration
Sasha Khapyorsky
sashak at voltaire.com
Wed Jul 25 15:02:04 PDT 2007
Hi Lan,
On 09:57 Wed 25 Jul , lbt wrote:
> Hello,
>
> I have been seeing a problem where a subscriber for in-service traps is not
> getting informed when the port of master openSM is restored (i.e. causing an
> SM migration).
>
> I have an IB subnet with 2 nodes running OpenSM , different priorities of
> course (OpenSM Rev:openib-2.0.5). I also have another node on the subnet
> that has subscribed for the forwarding of any IB_SA_GENERIC_TRAP_NUM_IN_SVC
> trap events. I've been doing cable pull tests on the IB ports, to check if
> the in-service handler I have subscribed gets invoked when I restore the
> cable. I've noticed that everything works as expected ( i.e. my in-service
> handler is invoked) whenever I restore the cable on the lower priority SM IB
> port without ever touching the master SM port. But if I cause an SM
> migration, by restoring the port of the higher priority SM, the in-service
> trap does not get generated as expected on a cable restore.
>
> Steps to Reproduce:
> 1) Start with port to higher priority SM disconnected.
> 2) restore port cable on the higher priority SM
> --> This causes an SM Migration as expected, SM's migration happens okay
> --> I expected the restoration of the higher priority SM to tit to also
> trigger an in-service trap as well and notify subscribers, but it doesn't
> occur
>
> I have collected debug messages log for both open SM's, and it appears that
> the reason is because:
> 1) in-service traps are generated based on what ports are added on the
> Master SM's new_ports_list, but these traps are generated only after LID
> assignment
> 2) when the higher priority SM port is restored, the restored port gets
> added to the lower priority SM's new_ports_list (since it's still the Master
> SM at that point in time)
> 3) the handover of Master SM from lower priority to higher priority SM
> occurs (before LID assignment and thus a chance for traps get generated for
> those ports on new_ports_list)
> 4) the higher priority SM is now Master SM, but it has an empty
> new_ports_list, so no trap generated either
>
> Does this look like a legitimate Open SM bug? Any feedback would be much
> appreciated, and if I can help further in any way please let me know .
As far as I know when OpenSM (even old like 2.0.5) becomes master it
requests client to reregister SA related stuff (by setting this bit in
PortInfo).
Probably your port doesn't not support this (you could verify by seeing
PortInfo:CapabilityMask - use 'smpquery portinfo <client-port-lid>') or
maybe your host stack doesn't do reregistration?
Anyway you could track this in the OpenSM code in osm_lid_mgr.c
__osm_lid_mgr_set_physp_pi() whenever client reregistration bit is set
(with ib_port_info_set_client_rereg()) or not. Then we will know more
about this problem.
Sasha
>
>
> Subset of logs from lower priority SM during the cable restore of higher
> priority SM port:
> ### Jul 18 14:31:56 614522 [41401960] -> __osm_trap_rcv_process_request:
> Received Generic Notice type:0x03 num:128 Producer:2 from LID:0x000A
> TID:0x00000016000012e1
> ### Jul 18 14:31:56 614823 [41401960] -> osm_state_mgr_process: Received
> signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_IDLE
> ### 14:31:56 ******************** INITIATING HEAVY SWEEP
> **********************
> ### Jul 18 14:31:56 616887 [42803960] -> osm_state_mgr_process: Received
> signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state
> OSM_SM_STATE_SWEEP_HEAVY_SELF
> Jul 18 14:31:56 626078 [42803960] -> __osm_ni_rcv_process_new: Adding port
> GUID:0x00504501483e0000 to new_ports_list
> Jul 18 14:31:56 626524 [42803960] -> osm_state_mgr_process: Received signal
> OSM_SIGNAL_CHANGE_DETECTED in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET
> Jul 18 14:31:56 632630 [41E02960] -> osm_state_mgr_process: Received signal
> OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET
> 14:31:56 ********************* HEAVY SWEEP COMPLETE ***********************
> Jul 18 14:31:56 632773 [41E02960] -> osm_sm_state_mgr_process: Received
> signal OSM_SM_SIGNAL_HANDOVER_SENT in state IB_SMINFO_STATE_MASTER###
> 14:31:56 ******************** ENTERING SM STANDBY STATE *******************
>
> Subset of logs from higher priority SM during the cable restore of higher
> priority SM port:
>
> Jul 18 14:32:02 995600 [41401960] -> osm_sm_state_mgr_process: [
> Jul 18 14:32:02 995605 [41401960] -> osm_sm_state_mgr_process: Received
> signal OSM_SM_SIGNAL_DISCOVERY_COMPLETED in state
> IB_SMINFO_STATE_DISCOVERING
> Jul 18 14:32:02 995609 [41401960] -> Entering MASTER state
> Jul 18 14:32:02 995888 [41401960] -> __osm_sm_state_mgr_master_msg:
> ******************** ENTERING SM MASTER STATE ********************
> Jul 18 14:32:03 009014 [41401960] -> __osm_state_mgr_set_sm_lid_done_msg:
> **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG *****
> Jul 18 14:32:03 024047 [41E02960] -> __osm_state_mgr_lid_assign_msg
> ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG *****
> Jul 18 14:32:03 024052 [41E02960] -> __osm_state_mgr_report_new_ports: [
> ----> no in-service traps are generated and notices forwarded because there
> are no ports on this list
> Jul 18 14:32:03 024057 [41E02960] -> __osm_state_mgr_report_new_ports: ]
>
>
> Thanks!
> Lan
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
More information about the general
mailing list