[ofa-general] Lost in-service traps during Open SM migration

lbt transter at gmail.com
Thu Jul 26 12:37:10 PDT 2007


Thanks for the suggestion Sasha!

Our host stack does receive a rereregistration notice and does resubscribe
all handlers at
that point in time. At the time of the SM migration, our stack prints out
some informational messages to
confirm this:
Jul 18 14:31:09 localhost kernel: Event IB_EVENT_CLIENT_REREGISTER occurred
on port 1
Jul 18 14:31:09 localhost kernel: OpemSM migrated, old SM LID=1 new SM LID=8

And also confirmed in the SM logs that after the migration, the higher
priority SM is getting a subscription request for in-service trap:
Jul 18 14:32:13 103550 [41E02960] -> osm_infr_rcv_process_set_method:
Subscribe Request with QPN: 0x000001
Jul 18 14:32:13 103554 [41E02960] -> osm_infr_get_by_rec: [
Jul 18 14:32:13 103558 [41E02960] -> __dump_all_informs: [
Jul 18 14:32:13 103562 [41E02960] -> InformInfo dump:
                                gid.....................0x0000000000000000 :
0x0000000000000000
                                lid_range_begin.........0xFFFF
                                lid_range_end...........0x0
                                is_generic..............0x1
                                subscribe...............0x0
                                trap_type...............0x3
                                trap_num................64
                                qpn.....................0x000001
                                resp_time_val...........0x0
                                node_type...............0x000004
Jul 18 14:32:13 103569 [41E02960] -> __dump_all_informs: ]

It maybe a problem if the resubscription of the in-service handler occurs
after the in-service notice was forwarded, but I think the problem is that
there is never a notice that is forwared for the higher priority SM port
that is restored. Perhaps, neither SM (the lower priority and higher
priority one), generates an in-service trap because of the timing  gap
between when the restored port is detected and "marked" (i.e. added to
new_ports_list) and when in-service traps are generated for new ports.
During SM migration, the lower priority SM detects the new port, but the
higher priority SM does the trap generation (but it doesn't realize that
it's own port is a new port and thus doesn't generate a trap for it).

Our host stack executes some functions when a port is restored  (in our
in-service subscription handler).
Am I not supposed to receive an in-service trap for a restored port that
happens to be the Master SM, and instead  execute these actions with a
client reregistration event?

Thanks again for your help!
Lan



On 7/25/07, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
> Hi Lan,
>
> On 09:57 Wed 25 Jul     , lbt wrote:
> >  Hello,
> >
> >  I have been seeing a problem where a subscriber for in-service traps is
> not
> >  getting informed when the port of master openSM is restored (i.e.
> causing an
> >  SM migration).
> >
> >  I have an IB subnet with 2 nodes running OpenSM , different priorities
> of
> >  course (OpenSM Rev:openib-2.0.5). I also have another node on the
> subnet
> >  that has subscribed for the forwarding of any
> IB_SA_GENERIC_TRAP_NUM_IN_SVC
> >  trap events. I've been doing cable pull tests on the IB ports, to check
> if
> >  the in-service handler I have subscribed gets invoked when I restore
> the
> >  cable. I've noticed that everything works as expected ( i.e. my
> in-service
> >  handler is invoked) whenever I restore the cable on the lower priority
> SM IB
> >  port without ever touching the master SM port. But if I cause an SM
> >  migration, by restoring the port of the higher priority SM, the
> in-service
> >  trap does not get generated as expected on a cable restore.
> >
> >  Steps to Reproduce:
> >  1) Start with port to higher priority SM disconnected.
> >  2) restore port cable on the higher priority SM
> >  --> This causes an SM Migration as expected, SM's migration happens
> okay
> >  --> I expected the restoration of the higher priority SM to tit to also
> >  trigger an in-service trap as well and notify subscribers, but it
> doesn't
> >  occur
> >
> >  I have collected debug messages log for both open SM's, and it appears
> that
> >  the reason is because:
> >  1) in-service traps are generated based on what ports are added on the
> >  Master SM's new_ports_list, but these traps are generated only after
> LID
> >  assignment
> >  2) when the higher priority SM port is restored, the restored port gets
> >  added to the lower priority SM's new_ports_list (since it's still the
> Master
> >  SM at that point in time)
> >  3) the handover of Master  SM  from lower priority to higher priority
> SM
> >  occurs (before LID assignment and thus a chance for traps get generated
> for
> >  those ports on new_ports_list)
> >  4) the higher priority SM is now Master SM, but it has an empty
> >  new_ports_list, so no trap generated either
> >
> >  Does this look like a legitimate Open SM bug? Any feedback would be
> much
> >  appreciated, and if I can help further in any way please let me know .
>
> As far as I know when OpenSM (even old like 2.0.5) becomes master it
> requests client to reregister SA related stuff (by setting this bit in
> PortInfo).
>
> Probably your port doesn't not support this (you could verify by seeing
> PortInfo:CapabilityMask - use 'smpquery portinfo <client-port-lid>') or
> maybe your host stack doesn't do reregistration?
>
> Anyway you could track this in the OpenSM code in osm_lid_mgr.c
> __osm_lid_mgr_set_physp_pi() whenever client reregistration bit is set
> (with ib_port_info_set_client_rereg()) or not. Then we will know more
> about this problem.
>
> Sasha
>
> >
> >
> >  Subset of logs from lower priority SM during the cable restore of
> higher
> >  priority SM port:
> >  ### Jul 18 14:31:56 614522 [41401960] ->
> __osm_trap_rcv_process_request:
> >  Received Generic Notice type:0x03 num:128 Producer:2 from LID:0x000A
> >  TID:0x00000016000012e1
> >  ### Jul 18 14:31:56 614823 [41401960] -> osm_state_mgr_process:
> Received
> >  signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_IDLE
> >  ### 14:31:56 ******************** INITIATING HEAVY SWEEP
> >  **********************
> >  ### Jul 18 14:31:56 616887 [42803960] -> osm_state_mgr_process:
> Received
> >  signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state
> >  OSM_SM_STATE_SWEEP_HEAVY_SELF
> >  Jul 18 14:31:56 626078 [42803960] -> __osm_ni_rcv_process_new: Adding
> port
> >  GUID:0x00504501483e0000 to new_ports_list
> >  Jul 18 14:31:56 626524 [42803960] -> osm_state_mgr_process: Received
> signal
> >  OSM_SIGNAL_CHANGE_DETECTED in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET
> >  Jul 18 14:31:56 632630 [41E02960] -> osm_state_mgr_process: Received
> signal
> >  OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state
> OSM_SM_STATE_SWEEP_HEAVY_SUBNET
> >  14:31:56 ********************* HEAVY SWEEP COMPLETE
> ***********************
> >  Jul 18 14:31:56 632773 [41E02960] -> osm_sm_state_mgr_process: Received
> >  signal OSM_SM_SIGNAL_HANDOVER_SENT in state IB_SMINFO_STATE_MASTER###
> >  14:31:56 ******************** ENTERING SM STANDBY STATE
> *******************
> >
> >  Subset of logs from higher priority SM during the cable restore of
> higher
> >  priority SM port:
> >
> >  Jul 18 14:32:02 995600 [41401960] -> osm_sm_state_mgr_process: [
> >  Jul 18 14:32:02 995605 [41401960] -> osm_sm_state_mgr_process: Received
> >  signal OSM_SM_SIGNAL_DISCOVERY_COMPLETED in state
> >  IB_SMINFO_STATE_DISCOVERING
> >  Jul 18 14:32:02 995609 [41401960] -> Entering MASTER state
> >  Jul 18 14:32:02 995888 [41401960] -> __osm_sm_state_mgr_master_msg:
> >  ******************** ENTERING SM MASTER STATE ********************
> >  Jul 18 14:32:03 009014 [41401960] ->
> __osm_state_mgr_set_sm_lid_done_msg:
> >  **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG *****
> >  Jul 18 14:32:03 024047 [41E02960] -> __osm_state_mgr_lid_assign_msg
> >  ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG *****
> >  Jul 18 14:32:03 024052 [41E02960] -> __osm_state_mgr_report_new_ports:
> [
> >  ----> no in-service traps are generated and notices forwarded because
> there
> >  are no ports on this list
> >  Jul 18 14:32:03 024057 [41E02960] -> __osm_state_mgr_report_new_ports:
> ]
> >
> >
> >  Thanks!
> >  Lan
>
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070726/441b5898/attachment.html>


More information about the general mailing list