[ofw] RE: ibbus disable on HCA0 erroneously removes all IPoIB instances; including IPoIB ports on HCA1 ?
Leonid Keller
leonid at mellanox.co.il
Mon Mar 9 02:18:33 PDT 2009
Applied in 2019.
> -----Original Message-----
> From: Leonid Keller
> Sent: Thursday, March 05, 2009 4:16 PM
> To: 'Smith, Stan'; 'Anatoly Greenblatt'
> Cc: ofw at lists.openfabrics.org
> Subject: RE: ibbus disable on HCA0 erroneously removes all
> IPoIB instances; including IPoIB ports on HCA1 ?
>
> Find attached a patch that removes registration HCA with IBAL.
>
> It should have been done anyway independent of reported problems.
> Wrt the problems:
>
> 1. (Stan) "ibbus disable on HCA0 erroneously removes all
> IPoIB instances; including IPoIB ports on HCA1 ?"
> This patch seems like solves this problem when working
> without WinVerbs&WinMad.
> With Win* drivers one can get a crash, playing disable/enable
> with MLX4_HCA.
> I believe, it doesn't related to the patch. I'll describe it
> in another tread.
>
> 2. (Anatoly) "winof 2.0.2: crash in ibbus.sys when running
> whql testsonmlx4hca"
> I don't think this patch will cause MLX4_HCA to pass pnpdtest.
> But may be the crash will go away. I don't know what exactly
> case of pnpdtest caused the crash.
> Anatoly, could you try it, having applied the patch ?
>
>
> > -----Original Message-----
> > From: Smith, Stan [mailto:stan.smith at intel.com]
> > Sent: Saturday, February 21, 2009 1:26 AM
> > To: Leonid Keller
> > Cc: ofw at lists.openfabrics.org
> > Subject: RE: ibbus disable on HCA0 erroneously removes all IPoIB
> > instances; including IPoIB ports on HCA1 ?
> >
> > Hello Leonid,
> > Thanks for taking the time to consider this curious ibbus/ipoib
> > problem.
> >
> > I understand what you are saying w.r.t. flows, although I'm
> confused
> > in that disabling the 1st HCA used to only disable the 1st
> two IPoIB
> > instances, not all 4. So I have to ask myself, what has changed to
> > induce this new behavior? WHQL patches perhaps?
> >
> > The previous disable HCA-0 behavior, granted it was incorrect for
> > other reasons, disabled IPoIB instance 0 & 1, correctly
> leaving IPoIB
> > instances 2 & 3 alone. The problem was the user accessible
> IBAL device
> > was bound to HCA-0, hence IBAL access was disabled when HCA-0 was
> > disabled, even though HCA-1 was alive and well; vstat
> stopped working.
> > Upon discovery we talked about implementing a Control Device Object
> > which the user accessible IBAL device would be bound to,
> thus allowing
> > HCAs to come and go (disable, enable) without breaking user-mode
> > access to IBAL (provided there was at least one HCA enabled).
> >
> > With the current HCA-0 disable behavior, a Control Device
> Object for
> > the user-accessible IBAL device is not required as disabling either
> > HCA device requires a reboot.
> >
> > About your comment 'And the right solution is to remove
> this mechanism
> > at all!', are you suggesting a conversion to KMDF PNP framework?
> >
> > Thanks,
> >
> > Stan.
> >
> >
> > Leonid Keller wrote:
> > > > Any ideas on the reasons why the 2nd
> port_mgr_port_remove() call
> > > was invoked?
> > > To my guess, the problem is created by the HCA's mechanism of
> > > registration for IBAL arrival.
> > > And the right solution is to remove this mechanism at all!
> > >
> > > Here is my theory:
> > > To remind: IBAL historically was sitting under ROOT and could be
> > > loaded after HCA driver.
> > > So HCA driver made registration with OS on the arrival of
> IBAL low
> > > interface.
> > > And this code still works!
> > > What happens in this case ?
> > > Two HCA devices get started and make the above registration.
> > > When the first IBAL instance is started, both HCA devices get
> > > notification about it and register themselves with this instance.
> > > When we disable the first HCA, both HCAs get notification
> about the
> > > removing of IBAL (in __pnp_notify_ifc) and deregister HCA
> > from IBAL,
> > > which remove all IPoIB devices.
> > >
> > > So we have now two flows of removing device, which works
> > > simultaneously.
> > > 1. The normal PnP flow, caused by IRP_MN_REMOVE_DEVICE:
> > > ibbus!cl_pnp
> > > ibbus!__remove
> > > ibbus!cl_do_remove // Pass
> the IRP down
> > > mlx4_hca!cl_pnp //
> > > IRP_MN_REMOVE_DEVICE
> > > mlx4_hca!__remove
> > > mlx4_hca!cl_do_remove
> > >
> > > mlx4_hca!hca_release_resources
> > >
> > > mlx4_hca!__hca_release_resources
> > >
> > > mlx4_hca!__hca_deregister
> > >
> > > ibbus!ib_deregister_ca
> > >
> > > 2. A flow, caused by the notification on IBAL
> interface remove
> > >
> > > mlx4_hca!__pnp_notify_ifc
> > > ibbus!ib_deregister_ca
> > >
> > > Just FYI: the flow of ib_deregister_ca:
> > > ib_deregister_ca
> > > destroying_ci_ca // from
> > > p_ci_ca->obj.pfn_destroy
> > > sync_destroy_obj
> > > destroy_obj
> > > destroying_ci_ca
> > > pnp_ca_event( p_ci_ca,
> > > IB_PNP_CA_REMOVE );
> > >
> cl_async_proc_queue
> > > And in other thread:
> > > __pnp_process_remove_ca
> > > __pnp_process_remove_port
> > > __pnp_notify_user
> > > port_mgr_pnp_cb
> > > port_mgr_port_remove
> > >
> > >
> > >> -----Original Message-----
> > >> From: Smith, Stan [mailto:stan.smith at intel.com]
> > >> Sent: Saturday, February 07, 2009 3:05 AM
> > >> To: Leonid Keller
> > >> Cc: ofw at lists.openfabrics.org
> > >> Subject: ibbus disable on HCA0 erroneously removes all IPoIB
> > >> instances; including IPoIB ports on HCA1 ?
> > >>
> > >> Hello,
> > >> Recently I discovered some bad HCA disable behavior
> > which used to
> > >> work correctly?
> > >>
> > >> Has the disable behavior for HCA0 been changed recently
> > such that all
> > >> existing IPoIB instances for all HCAs are removed?
> > >>
> > >> Details:
> > >>
> > >> For an x86 system using svn.1932 mthca.sys & ibbus.sys
> with two Mx
> > >> MT23108 HCAs (1 port active, one port disconnected per
> > HCA), no WSD
> > >> or WinOF install, just bare mthca, ibbus & IPoIB.
> > >>
> > >> When both HCAs are enabled there are 4 IPoIB instances.
> > >>
> > >> When the 1st HCA as seen by PNP (HCA0 for discussion
> > >> purposes) is disabled, all 4 IPoIB instances are removed
> from the
> > >> device manager view along with the expected HCA0 disabled.
> > >> The 2nd HCA (HCA1) is still enabled with no IPoIB
> > instances shown by
> > >> the device manager.
> > >>
> > >> The expected behavior when disabling HCA0 should be the
> > 1st two IPoIB
> > >> instances [0 & 2] would be removed from the device manager
> > view, with
> > >> the 2nd two IPoIB instances [3 & 4] remaining.
> > >> This is the case which exposes the ibbus bug where vstat
> no longer
> > >> works because \Devices\ibal has been removed as it's bound
> > to the 1st
> > >> PNP seen HCA which is now disabled.
> > >>
> > >> If you reverse the disable order, such that HCA1 is
> disabled while
> > >> HCA0 remains enabled, the expected IPoIB instances [3 & 4] are
> > >> removed; while instances [0 & 1] remain.
> > >>
> > >> The problem occurs when cl_pnp() calls
> > >> ibbus::port_mgr_pnp_cb() to remove the IPoIB instances for
> > HCA1; the
> > >> previous call to ibbus::port_mgr_pnp_cb() for HCA0 is correct.
> > >>
> > >> fdo_query_remove() [
> > >> IRP_MN_QUERY_REMOVE_DEVICE IB Bus @ FDO FAB160E8 refs(CI 0 AL 0)
> > >> bfi-0 CA 8025000002c90200 fdo_query_remove() ]
> > >> __query_remove() ]
> > >> cl_pnp(): IrpSkip/IrpIgnore: skipping down to PDO 81DDD420, ext
> > >> FAB160E8, status 0
> > >> cl_pnp(): returned with status 0
> > >> cl_pnp() ]
> > >> port_mgr_pnp_cb() [
> > >> port_mgr_pnp_cb() ]
> > >> port_mgr_pnp_cb() [
> > >> port_mgr_port_remove() [
> > >> bfi-0 ca_guid 0x8025000002c90200 port_num 1 port_mgr 81D46008
> > >> port_mgr_port_remove(): Mark removing IPoIB: PDO 81DA3BD8, ext
> > >> 81DA3C90, present 0, missing 0
> > >> port_mgr_port_remove() ]
> > >> port_mgr_pnp_cb() ]
> > >> port_mgr_pnp_cb() [
> > >> port_mgr_port_remove() [
> > >> bfi-0 ca_guid 0x8025000002c90200 port_num 2 port_mgr 81D46008
> > >> port_mgr_port_remove(): Mark removing IPoIB: PDO 81DA3350, ext
> > >> 81DA3408, present 0, missing 0
> > >> port_mgr_port_remove() ]
> > >> port_mgr_pnp_cb() ]
> > >> iou_mgr_pnp_cb() [
> > >> iou_mgr_iou_remove() [
> > >> bfi-0 ca_guid 0x8025000002c90200 iou_mgr FED74310
> > >> iou_mgr_iou_remove(): bfi-0 IB IOU: ext FF58B7B8, present
> > 0, missing
> > >> 1 . iou_mgr_iou_remove() ]
> > >> iou_mgr_pnp_cb() ]
> > >>
> > >> XXX - this PNP call for HCA1 should not of occurred when
> disabling
> > >> HCA0.
> > >>
> > >> port_mgr_port_remove() [
> > >> bfi-1 ca_guid 0xa425000002c90200 port_num 1 port_mgr 82030F40
> > >> port_mgr_port_remove(): Mark removing IPoIB: PDO FED75DD8, ext
> > >> FED75E90, present 0, missing 0
> > >> port_mgr_port_remove() ]
> > >> port_mgr_pnp_cb() ]
> > >> port_mgr_pnp_cb() [
> > >> port_mgr_port_remove() [
> > >> bfi-1 ca_guid 0xa425000002c90200 port_num 2 port_mgr 82030F40
> > >> port_mgr_port_remove(): Mark removing IPoIB: PDO FF881DD8, ext
> > >> FF881E90, present 0, missing 0
> > >> port_mgr_port_remove() ]
> > >> port_mgr_pnp_cb() ]
> > >> iou_mgr_pnp_cb() [
> > >> iou_mgr_iou_remove() [
> > >> bfi-1 ca_guid 0xa425000002c90200 iou_mgr 821A8A80
> > >> iou_mgr_iou_remove(): bfi-1 IB IOU: ext FAC24620, present
> > 0, missing
> > >> 1 . iou_mgr_iou_remove() ]
> > >> iou_mgr_pnp_cb() ]
> > >>
> > >> XXX end of badness...
> > >>
> > >> Any ideas on the reasons why the 2nd
> > port_mgr_port_remove() call was
> > >> invoked?
> > >> Is there some binding between HCA1 IPoIB ports and HCA0?
> > >>
> > >> Thanks,
> > >>
> > >> Stan.
> >
> >
More information about the ofw
mailing list