[ofw] RE: ibbus disable on HCA0 erroneously removes all IPoIB instances; including IPoIB ports on HCA1 ?

Leonid Keller leonid at mellanox.co.il
Mon Mar 9 02:18:33 PDT 2009


Applied in 2019. 

> -----Original Message-----
> From: Leonid Keller 
> Sent: Thursday, March 05, 2009 4:16 PM
> To: 'Smith, Stan'; 'Anatoly Greenblatt'
> Cc: ofw at lists.openfabrics.org
> Subject: RE: ibbus disable on HCA0 erroneously removes all 
> IPoIB instances; including IPoIB ports on HCA1 ?
> 
> Find attached a patch that removes registration HCA with IBAL.
> 
> It should have been done anyway independent of reported problems. 
> Wrt the problems:
> 
> 1.  (Stan) "ibbus disable on HCA0 erroneously removes all 
> IPoIB instances; including IPoIB ports on HCA1 ?"
> This patch seems like solves this problem when working 
> without WinVerbs&WinMad.
> With Win* drivers one can get a crash, playing disable/enable 
> with MLX4_HCA.
> I believe, it doesn't related to the patch. I'll describe it 
> in another tread.
> 
> 2. (Anatoly) "winof 2.0.2: crash in ibbus.sys when running 
> whql testsonmlx4hca"
> I don't think this patch will cause MLX4_HCA to pass pnpdtest.
> But may be the crash will go away. I don't know what exactly 
> case of pnpdtest caused the crash.
> Anatoly, could you try it, having applied the patch ?
> 
> 
> > -----Original Message-----
> > From: Smith, Stan [mailto:stan.smith at intel.com]
> > Sent: Saturday, February 21, 2009 1:26 AM
> > To: Leonid Keller
> > Cc: ofw at lists.openfabrics.org
> > Subject: RE: ibbus disable on HCA0 erroneously removes all IPoIB 
> > instances; including IPoIB ports on HCA1 ?
> > 
> > Hello Leonid,
> >   Thanks for taking the time to consider this curious ibbus/ipoib 
> > problem.
> > 
> > I understand what you are saying w.r.t. flows, although I'm 
> confused 
> > in that disabling the 1st HCA used to only disable the 1st 
> two IPoIB 
> > instances, not all 4. So I have to ask myself, what has changed to 
> > induce this new behavior? WHQL patches perhaps?
> > 
> > The previous disable HCA-0 behavior, granted it was incorrect for 
> > other reasons, disabled IPoIB instance 0 & 1, correctly 
> leaving IPoIB 
> > instances 2 & 3 alone. The problem was the user accessible 
> IBAL device 
> > was bound to HCA-0, hence IBAL access was disabled when HCA-0 was 
> > disabled, even though HCA-1 was alive and well; vstat 
> stopped working.
> > Upon discovery we talked about implementing a Control Device Object 
> > which the user accessible IBAL device would be bound to, 
> thus allowing 
> > HCAs to come and go (disable, enable) without breaking user-mode 
> > access to IBAL (provided there was at least one HCA enabled).
> > 
> > With the current HCA-0 disable behavior, a Control Device 
> Object for 
> > the user-accessible IBAL device is not required as disabling either 
> > HCA device requires a reboot.
> > 
> > About your comment 'And the right solution is to remove 
> this mechanism 
> > at all!', are you suggesting a conversion to KMDF PNP framework?
> > 
> > Thanks,
> > 
> > Stan.
> > 
> > 
> > Leonid Keller wrote:
> > >  > Any ideas on the reasons why the 2nd 
> port_mgr_port_remove() call 
> > > was invoked?
> > > To my guess, the problem is created by the HCA's mechanism of 
> > > registration for IBAL arrival.
> > > And the right solution is to remove this mechanism at all!
> > >
> > > Here is my theory:
> > > To remind: IBAL historically was sitting under ROOT and could be 
> > > loaded after HCA driver.
> > > So HCA driver made registration with OS on the arrival of 
> IBAL low 
> > > interface.
> > > And this code still works!
> > > What happens in this case ?
> > > Two HCA devices get started and make the above registration.
> > > When the first IBAL instance is started, both HCA devices get 
> > > notification about it and register themselves with this instance.
> > > When we disable the first HCA, both HCAs get notification 
> about the 
> > > removing of IBAL (in __pnp_notify_ifc) and deregister HCA
> > from IBAL,
> > > which remove all IPoIB devices.
> > >
> > > So we have now two flows of removing device, which works 
> > > simultaneously.
> > > 1. The normal PnP flow, caused by IRP_MN_REMOVE_DEVICE:
> > > ibbus!cl_pnp
> > >         ibbus!__remove
> > >                 ibbus!cl_do_remove              // Pass 
> the IRP down
> > >                         mlx4_hca!cl_pnp                 //
> > > IRP_MN_REMOVE_DEVICE
> > >                                 mlx4_hca!__remove
> > >                                         mlx4_hca!cl_do_remove
> > >
> > > mlx4_hca!hca_release_resources
> > >
> > > mlx4_hca!__hca_release_resources
> > >
> > > mlx4_hca!__hca_deregister
> > >
> > > ibbus!ib_deregister_ca
> > >
> > > 2. A flow, caused       by the notification on IBAL 
> interface remove
> > >
> > > mlx4_hca!__pnp_notify_ifc
> > >         ibbus!ib_deregister_ca
> > >
> > > Just FYI: the flow of ib_deregister_ca:
> > > ib_deregister_ca
> > >         destroying_ci_ca                        // from
> > > p_ci_ca->obj.pfn_destroy
> > >                 sync_destroy_obj
> > >                         destroy_obj
> > >                                 destroying_ci_ca
> > >                                         pnp_ca_event( p_ci_ca, 
> > > IB_PNP_CA_REMOVE );
> > >                                                 
> cl_async_proc_queue 
> > > And in other thread:
> > >         __pnp_process_remove_ca
> > >                 __pnp_process_remove_port
> > >                         __pnp_notify_user
> > >                                 port_mgr_pnp_cb
> > >                                         port_mgr_port_remove
> > >
> > >
> > >> -----Original Message-----
> > >> From: Smith, Stan [mailto:stan.smith at intel.com]
> > >> Sent: Saturday, February 07, 2009 3:05 AM
> > >> To: Leonid Keller
> > >> Cc: ofw at lists.openfabrics.org
> > >> Subject: ibbus disable on HCA0 erroneously removes all IPoIB 
> > >> instances; including IPoIB ports on HCA1 ?
> > >>
> > >> Hello,
> > >>   Recently I discovered some bad HCA disable behavior
> > which used to
> > >> work correctly?
> > >>
> > >> Has the disable behavior for HCA0 been changed recently
> > such that all
> > >> existing IPoIB instances for all HCAs are removed?
> > >>
> > >> Details:
> > >>
> > >> For an x86 system using svn.1932 mthca.sys & ibbus.sys 
> with two Mx
> > >> MT23108 HCAs (1 port active, one port disconnected per
> > HCA), no WSD
> > >> or WinOF install, just bare mthca, ibbus & IPoIB.
> > >>
> > >> When both HCAs are enabled there are 4 IPoIB instances.
> > >>
> > >> When the 1st HCA as seen by PNP (HCA0 for discussion
> > >> purposes) is disabled, all 4 IPoIB instances are removed 
> from the 
> > >> device manager view along with the expected HCA0 disabled.
> > >> The 2nd HCA (HCA1) is still enabled with no IPoIB
> > instances shown by
> > >> the device manager.
> > >>
> > >> The expected behavior when disabling HCA0 should be the
> > 1st two IPoIB
> > >> instances [0 & 2] would be removed from the device manager
> > view, with
> > >> the 2nd two IPoIB instances [3 & 4] remaining.
> > >> This is the case which exposes the ibbus bug where vstat 
> no longer 
> > >> works because \Devices\ibal has been removed as it's bound
> > to the 1st
> > >> PNP seen HCA which is now disabled.
> > >>
> > >> If you reverse the disable order, such that HCA1 is 
> disabled while 
> > >> HCA0 remains enabled, the expected IPoIB instances [3 & 4] are 
> > >> removed; while instances [0 & 1] remain.
> > >>
> > >> The problem occurs when cl_pnp() calls
> > >> ibbus::port_mgr_pnp_cb() to remove the IPoIB instances for
> > HCA1; the
> > >> previous call to ibbus::port_mgr_pnp_cb() for HCA0 is correct.
> > >>
> > >> fdo_query_remove() [
> > >> IRP_MN_QUERY_REMOVE_DEVICE IB Bus @ FDO FAB160E8 refs(CI 0 AL 0) 
> > >> bfi-0 CA 8025000002c90200 fdo_query_remove() ]
> > >> __query_remove() ]
> > >> cl_pnp(): IrpSkip/IrpIgnore: skipping down to PDO 81DDD420, ext 
> > >> FAB160E8, status 0
> > >> cl_pnp(): returned with status 0
> > >> cl_pnp() ]
> > >> port_mgr_pnp_cb() [
> > >> port_mgr_pnp_cb() ]
> > >> port_mgr_pnp_cb() [
> > >> port_mgr_port_remove() [
> > >> bfi-0 ca_guid 0x8025000002c90200 port_num 1 port_mgr 81D46008
> > >> port_mgr_port_remove(): Mark removing IPoIB: PDO 81DA3BD8, ext 
> > >> 81DA3C90, present 0, missing 0
> > >> port_mgr_port_remove() ]
> > >> port_mgr_pnp_cb() ]
> > >> port_mgr_pnp_cb() [
> > >> port_mgr_port_remove() [
> > >> bfi-0 ca_guid 0x8025000002c90200 port_num 2 port_mgr 81D46008
> > >> port_mgr_port_remove(): Mark removing IPoIB: PDO 81DA3350, ext 
> > >> 81DA3408, present 0, missing 0
> > >> port_mgr_port_remove() ]
> > >> port_mgr_pnp_cb() ]
> > >> iou_mgr_pnp_cb() [
> > >> iou_mgr_iou_remove() [
> > >> bfi-0 ca_guid 0x8025000002c90200 iou_mgr FED74310
> > >> iou_mgr_iou_remove(): bfi-0 IB IOU: ext FF58B7B8, present
> > 0, missing
> > >> 1 . iou_mgr_iou_remove() ]
> > >> iou_mgr_pnp_cb() ]
> > >>
> > >> XXX - this PNP call for HCA1 should not of occurred when 
> disabling 
> > >> HCA0.
> > >>
> > >> port_mgr_port_remove() [
> > >> bfi-1 ca_guid 0xa425000002c90200 port_num 1 port_mgr 82030F40
> > >> port_mgr_port_remove(): Mark removing IPoIB: PDO FED75DD8, ext 
> > >> FED75E90, present 0, missing 0
> > >> port_mgr_port_remove() ]
> > >> port_mgr_pnp_cb() ]
> > >> port_mgr_pnp_cb() [
> > >> port_mgr_port_remove() [
> > >> bfi-1 ca_guid 0xa425000002c90200 port_num 2 port_mgr 82030F40
> > >> port_mgr_port_remove(): Mark removing IPoIB: PDO FF881DD8, ext 
> > >> FF881E90, present 0, missing 0
> > >> port_mgr_port_remove() ]
> > >> port_mgr_pnp_cb() ]
> > >> iou_mgr_pnp_cb() [
> > >> iou_mgr_iou_remove() [
> > >> bfi-1 ca_guid 0xa425000002c90200 iou_mgr 821A8A80
> > >> iou_mgr_iou_remove(): bfi-1 IB IOU: ext FAC24620, present
> > 0, missing
> > >> 1 . iou_mgr_iou_remove() ]
> > >> iou_mgr_pnp_cb() ]
> > >>
> > >> XXX end of badness...
> > >>
> > >> Any ideas on the reasons why the 2nd
> > port_mgr_port_remove() call was
> > >> invoked?
> > >> Is there some binding between HCA1 IPoIB ports and HCA0?
> > >>
> > >> Thanks,
> > >>
> > >> Stan.
> > 
> > 



More information about the ofw mailing list