[ofa-general] Lost in-service traps during Open SM migration

lbt transter at gmail.com
Fri Jul 27 07:47:25 PDT 2007


Hi Sasha,

Yes, the problem seems to appear only when there is an SM migration. I
receive in-service notices for other ports, as long as there is no SM
migration occurring.

Thanks,
Lan

On 7/26/07, Sasha Khapyorsky <sashak at voltaire.com> wrote:
>
> On 12:37 Thu 26 Jul     , lbt wrote:
> >  Thanks for the suggestion Sasha!
> >
> >  Our host stack does receive a rereregistration notice and does
> resubscribe
> >  all handlers at
> >  that point in time. At the time of the SM migration, our stack prints
> out
> >  some informational messages to
> >  confirm this:
> >  Jul 18 14:31:09 localhost kernel: Event IB_EVENT_CLIENT_REREGISTER
> occurred
> >  on port 1
> >  Jul 18 14:31:09 localhost kernel: OpemSM migrated, old SM LID=1 new SM
> LID=8
> >
> >  And also confirmed in the SM logs that after the migration, the higher
> >  priority SM is getting a subscription request for in-service trap:
> >  Jul 18 14:32:13 103550 [41E02960] -> osm_infr_rcv_process_set_method:
> >  Subscribe Request with QPN: 0x000001
> >  Jul 18 14:32:13 103554 [41E02960] -> osm_infr_get_by_rec: [
> >  Jul 18 14:32:13 103558 [41E02960] -> __dump_all_informs: [
> >  Jul 18 14:32:13 103562 [41E02960] -> InformInfo dump:
> >
> gid.....................0x0000000000000000 :
> >  0x0000000000000000
> >                                 lid_range_begin.........0xFFFF
> >                                 lid_range_end...........0x0
> >                                 is_generic..............0x1
> >                                 subscribe...............0x0
> >                                 trap_type...............0x3
> >                                 trap_num................64
> >                                 qpn.....................0x000001
> >                                 resp_time_val...........0x0
> >                                 node_type...............0x000004
> >  Jul 18 14:32:13 103569 [41E02960] -> __dump_all_informs: ]
> >
> >  It maybe a problem if the resubscription of the in-service handler
> occurs
> >  after the in-service notice was forwarded, but I think the problem is
> that
> >  there is never a notice that is forwared for the higher priority SM
> port
> >  that is restored.
>
> And after OpenSM migration, did you receive in-service notices for
> another ports? Does the problem happen only in migration time?
>
> >  Perhaps, neither SM (the lower priority and higher
> >  priority one), generates an in-service trap because of the timing  gap
> >  between when the restored port is detected and "marked" (i.e. added to
> >  new_ports_list) and when in-service traps are generated for new ports.
> >  During SM migration, the lower priority SM detects the new port, but
> the
> >  higher priority SM does the trap generation (but it doesn't realize
> that
> >  it's own port is a new port and thus doesn't generate a trap for it).
> >
> >  Our host stack executes some functions when a port is restored  (in our
> >  in-service subscription handler).
> >  Am I not supposed to receive an in-service trap for a restored port
> that
> >  happens to be the Master SM,
>
> Yes, I guess you are.
>
> >  and instead  execute these actions with a
> >  client reregistration event?
>
> Client reregistration request is not suitable here - SM can ask for
> client reregistration at any time (in practice OpenSM now does it only
> when enters MASTER state, but it is also optional).
>
> Sasha
>
> >
> >  Thanks again for your help!
> >  Lan
> >
> >
> >
> >  On 7/25/07, Sasha Khapyorsky <sashak at voltaire.com> wrote:
> > >
> > > Hi Lan,
> > >
> > > On 09:57 Wed 25 Jul     , lbt wrote:
> > > >  Hello,
> > > >
> > > >  I have been seeing a problem where a subscriber for in-service
> traps is
> > > not
> > > >  getting informed when the port of master openSM is restored (i.e.
> > > causing an
> > > >  SM migration).
> > > >
> > > >  I have an IB subnet with 2 nodes running OpenSM , different
> priorities
> > > of
> > > >  course (OpenSM Rev:openib-2.0.5). I also have another node on the
> > > subnet
> > > >  that has subscribed for the forwarding of any
> > > IB_SA_GENERIC_TRAP_NUM_IN_SVC
> > > >  trap events. I've been doing cable pull tests on the IB ports, to
> check
> > > if
> > > >  the in-service handler I have subscribed gets invoked when I
> restore
> > > the
> > > >  cable. I've noticed that everything works as expected ( i.e. my
> > > in-service
> > > >  handler is invoked) whenever I restore the cable on the lower
> priority
> > > SM IB
> > > >  port without ever touching the master SM port. But if I cause an SM
> > > >  migration, by restoring the port of the higher priority SM, the
> > > in-service
> > > >  trap does not get generated as expected on a cable restore.
> > > >
> > > >  Steps to Reproduce:
> > > >  1) Start with port to higher priority SM disconnected.
> > > >  2) restore port cable on the higher priority SM
> > > >  --> This causes an SM Migration as expected, SM's migration happens
> > > okay
> > > >  --> I expected the restoration of the higher priority SM to tit to
> also
> > > >  trigger an in-service trap as well and notify subscribers, but it
> > > doesn't
> > > >  occur
> > > >
> > > >  I have collected debug messages log for both open SM's, and it
> appears
> > > that
> > > >  the reason is because:
> > > >  1) in-service traps are generated based on what ports are added on
> the
> > > >  Master SM's new_ports_list, but these traps are generated only
> after
> > > LID
> > > >  assignment
> > > >  2) when the higher priority SM port is restored, the restored port
> gets
> > > >  added to the lower priority SM's new_ports_list (since it's still
> the
> > > Master
> > > >  SM at that point in time)
> > > >  3) the handover of Master  SM  from lower priority to higher
> priority
> > > SM
> > > >  occurs (before LID assignment and thus a chance for traps get
> generated
> > > for
> > > >  those ports on new_ports_list)
> > > >  4) the higher priority SM is now Master SM, but it has an empty
> > > >  new_ports_list, so no trap generated either
> > > >
> > > >  Does this look like a legitimate Open SM bug? Any feedback would be
> > > much
> > > >  appreciated, and if I can help further in any way please let me
> know .
> > >
> > > As far as I know when OpenSM (even old like 2.0.5) becomes master it
> > > requests client to reregister SA related stuff (by setting this bit in
> > > PortInfo).
> > >
> > > Probably your port doesn't not support this (you could verify by
> seeing
> > > PortInfo:CapabilityMask - use 'smpquery portinfo <client-port-lid>')
> or
> > > maybe your host stack doesn't do reregistration?
> > >
> > > Anyway you could track this in the OpenSM code in osm_lid_mgr.c
> > > __osm_lid_mgr_set_physp_pi() whenever client reregistration bit is set
> > > (with ib_port_info_set_client_rereg()) or not. Then we will know more
> > > about this problem.
> > >
> > > Sasha
> > >
> > > >
> > > >
> > > >  Subset of logs from lower priority SM during the cable restore of
> > > higher
> > > >  priority SM port:
> > > >  ### Jul 18 14:31:56 614522 [41401960] ->
> > > __osm_trap_rcv_process_request:
> > > >  Received Generic Notice type:0x03 num:128 Producer:2 from
> LID:0x000A
> > > >  TID:0x00000016000012e1
> > > >  ### Jul 18 14:31:56 614823 [41401960] -> osm_state_mgr_process:
> > > Received
> > > >  signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_IDLE
> > > >  ### 14:31:56 ******************** INITIATING HEAVY SWEEP
> > > >  **********************
> > > >  ### Jul 18 14:31:56 616887 [42803960] -> osm_state_mgr_process:
> > > Received
> > > >  signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state
> > > >  OSM_SM_STATE_SWEEP_HEAVY_SELF
> > > >  Jul 18 14:31:56 626078 [42803960] -> __osm_ni_rcv_process_new:
> Adding
> > > port
> > > >  GUID:0x00504501483e0000 to new_ports_list
> > > >  Jul 18 14:31:56 626524 [42803960] -> osm_state_mgr_process:
> Received
> > > signal
> > > >  OSM_SIGNAL_CHANGE_DETECTED in state OSM_SM_STATE_SWEEP_HEAVY_SUBNET
> > > >  Jul 18 14:31:56 632630 [41E02960] -> osm_state_mgr_process:
> Received
> > > signal
> > > >  OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state
> > > OSM_SM_STATE_SWEEP_HEAVY_SUBNET
> > > >  14:31:56 ********************* HEAVY SWEEP COMPLETE
> > > ***********************
> > > >  Jul 18 14:31:56 632773 [41E02960] -> osm_sm_state_mgr_process:
> Received
> > > >  signal OSM_SM_SIGNAL_HANDOVER_SENT in state
> IB_SMINFO_STATE_MASTER###
> > > >  14:31:56 ******************** ENTERING SM STANDBY STATE
> > > *******************
> > > >
> > > >  Subset of logs from higher priority SM during the cable restore of
> > > higher
> > > >  priority SM port:
> > > >
> > > >  Jul 18 14:32:02 995600 [41401960] -> osm_sm_state_mgr_process: [
> > > >  Jul 18 14:32:02 995605 [41401960] -> osm_sm_state_mgr_process:
> Received
> > > >  signal OSM_SM_SIGNAL_DISCOVERY_COMPLETED in state
> > > >  IB_SMINFO_STATE_DISCOVERING
> > > >  Jul 18 14:32:02 995609 [41401960] -> Entering MASTER state
> > > >  Jul 18 14:32:02 995888 [41401960] -> __osm_sm_state_mgr_master_msg:
> > > >  ******************** ENTERING SM MASTER STATE ********************
> > > >  Jul 18 14:32:03 009014 [41401960] ->
> > > __osm_state_mgr_set_sm_lid_done_msg:
> > > >  **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG *****
> > > >  Jul 18 14:32:03 024047 [41E02960] -> __osm_state_mgr_lid_assign_msg
> > > >  ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG *****
> > > >  Jul 18 14:32:03 024052 [41E02960] ->
> __osm_state_mgr_report_new_ports:
> > > [
> > > >  ----> no in-service traps are generated and notices forwarded
> because
> > > there
> > > >  are no ports on this list
> > > >  Jul 18 14:32:03 024057 [41E02960] ->
> __osm_state_mgr_report_new_ports:
> > > ]
> > > >
> > > >
> > > >  Thanks!
> > > >  Lan
> > >
> > > > _______________________________________________
> > > > general mailing list
> > > > general at lists.openfabrics.org
> > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > >
> > > > To unsubscribe, please visit
> > > http://openib.org/mailman/listinfo/openib-general
> > >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070727/c4c7bfc7/attachment.html>


More information about the general mailing list