[ofw] RE: ibbus disable on HCA0 erroneously removes all IPoIB instances; including IPoIB ports on HCA1 ?
Leonid Keller
leonid at mellanox.co.il
Thu Mar 5 06:15:33 PST 2009
Find attached a patch that removes registration HCA with IBAL.
It should have been done anyway independent of reported problems.
Wrt the problems:
1. (Stan) "ibbus disable on HCA0 erroneously removes all IPoIB
instances; including IPoIB ports on HCA1 ?"
This patch seems like solves this problem when working without
WinVerbs&WinMad.
With Win* drivers one can get a crash, playing disable/enable with
MLX4_HCA.
I believe, it doesn't related to the patch. I'll describe it in another
tread.
2. (Anatoly) "winof 2.0.2: crash in ibbus.sys when running whql
testsonmlx4hca"
I don't think this patch will cause MLX4_HCA to pass pnpdtest.
But may be the crash will go away. I don't know what exactly case of
pnpdtest caused the crash.
Anatoly, could you try it, having applied the patch ?
> -----Original Message-----
> From: Smith, Stan [mailto:stan.smith at intel.com]
> Sent: Saturday, February 21, 2009 1:26 AM
> To: Leonid Keller
> Cc: ofw at lists.openfabrics.org
> Subject: RE: ibbus disable on HCA0 erroneously removes all
> IPoIB instances; including IPoIB ports on HCA1 ?
>
> Hello Leonid,
> Thanks for taking the time to consider this curious
> ibbus/ipoib problem.
>
> I understand what you are saying w.r.t. flows, although I'm
> confused in that disabling the 1st HCA used to only disable
> the 1st two IPoIB instances, not all 4. So I have to ask
> myself, what has changed to induce this new behavior? WHQL
> patches perhaps?
>
> The previous disable HCA-0 behavior, granted it was incorrect
> for other reasons, disabled IPoIB instance 0 & 1, correctly
> leaving IPoIB instances 2 & 3 alone. The problem was the user
> accessible IBAL device was bound to HCA-0, hence IBAL access
> was disabled when HCA-0 was disabled, even though HCA-1 was
> alive and well; vstat stopped working.
> Upon discovery we talked about implementing a Control Device
> Object which the user accessible IBAL device would be bound
> to, thus allowing HCAs to come and go (disable, enable)
> without breaking user-mode access to IBAL (provided there was
> at least one HCA enabled).
>
> With the current HCA-0 disable behavior, a Control Device
> Object for the user-accessible IBAL device is not required as
> disabling either HCA device requires a reboot.
>
> About your comment 'And the right solution is to remove this
> mechanism at all!', are you suggesting a conversion to KMDF
> PNP framework?
>
> Thanks,
>
> Stan.
>
>
> Leonid Keller wrote:
> > > Any ideas on the reasons why the 2nd port_mgr_port_remove() call
> > was invoked?
> > To my guess, the problem is created by the HCA's mechanism of
> > registration for IBAL arrival.
> > And the right solution is to remove this mechanism at all!
> >
> > Here is my theory:
> > To remind: IBAL historically was sitting under ROOT and could be
> > loaded after HCA driver.
> > So HCA driver made registration with OS on the arrival of IBAL low
> > interface.
> > And this code still works!
> > What happens in this case ?
> > Two HCA devices get started and make the above registration.
> > When the first IBAL instance is started, both HCA devices get
> > notification about it and register themselves with this instance.
> > When we disable the first HCA, both HCAs get notification about the
> > removing of IBAL (in __pnp_notify_ifc) and deregister HCA
> from IBAL,
> > which remove all IPoIB devices.
> >
> > So we have now two flows of removing device, which works
> > simultaneously.
> > 1. The normal PnP flow, caused by IRP_MN_REMOVE_DEVICE:
> > ibbus!cl_pnp
> > ibbus!__remove
> > ibbus!cl_do_remove // Pass the IRP down
> > mlx4_hca!cl_pnp //
> > IRP_MN_REMOVE_DEVICE
> > mlx4_hca!__remove
> > mlx4_hca!cl_do_remove
> >
> > mlx4_hca!hca_release_resources
> >
> > mlx4_hca!__hca_release_resources
> >
> > mlx4_hca!__hca_deregister
> >
> > ibbus!ib_deregister_ca
> >
> > 2. A flow, caused by the notification on IBAL interface remove
> >
> > mlx4_hca!__pnp_notify_ifc
> > ibbus!ib_deregister_ca
> >
> > Just FYI: the flow of ib_deregister_ca:
> > ib_deregister_ca
> > destroying_ci_ca // from
> > p_ci_ca->obj.pfn_destroy
> > sync_destroy_obj
> > destroy_obj
> > destroying_ci_ca
> > pnp_ca_event( p_ci_ca,
> > IB_PNP_CA_REMOVE );
> > cl_async_proc_queue
> > And in other thread:
> > __pnp_process_remove_ca
> > __pnp_process_remove_port
> > __pnp_notify_user
> > port_mgr_pnp_cb
> > port_mgr_port_remove
> >
> >
> >> -----Original Message-----
> >> From: Smith, Stan [mailto:stan.smith at intel.com]
> >> Sent: Saturday, February 07, 2009 3:05 AM
> >> To: Leonid Keller
> >> Cc: ofw at lists.openfabrics.org
> >> Subject: ibbus disable on HCA0 erroneously removes all IPoIB
> >> instances; including IPoIB ports on HCA1 ?
> >>
> >> Hello,
> >> Recently I discovered some bad HCA disable behavior
> which used to
> >> work correctly?
> >>
> >> Has the disable behavior for HCA0 been changed recently
> such that all
> >> existing IPoIB instances for all HCAs are removed?
> >>
> >> Details:
> >>
> >> For an x86 system using svn.1932 mthca.sys & ibbus.sys with two Mx
> >> MT23108 HCAs (1 port active, one port disconnected per
> HCA), no WSD
> >> or WinOF install, just bare mthca, ibbus & IPoIB.
> >>
> >> When both HCAs are enabled there are 4 IPoIB instances.
> >>
> >> When the 1st HCA as seen by PNP (HCA0 for discussion
> >> purposes) is disabled, all 4 IPoIB instances are removed from the
> >> device manager view along with the expected HCA0 disabled.
> >> The 2nd HCA (HCA1) is still enabled with no IPoIB
> instances shown by
> >> the device manager.
> >>
> >> The expected behavior when disabling HCA0 should be the
> 1st two IPoIB
> >> instances [0 & 2] would be removed from the device manager
> view, with
> >> the 2nd two IPoIB instances [3 & 4] remaining.
> >> This is the case which exposes the ibbus bug where vstat no longer
> >> works because \Devices\ibal has been removed as it's bound
> to the 1st
> >> PNP seen HCA which is now disabled.
> >>
> >> If you reverse the disable order, such that HCA1 is disabled while
> >> HCA0 remains enabled, the expected IPoIB instances [3 & 4] are
> >> removed; while instances [0 & 1] remain.
> >>
> >> The problem occurs when cl_pnp() calls
> >> ibbus::port_mgr_pnp_cb() to remove the IPoIB instances for
> HCA1; the
> >> previous call to ibbus::port_mgr_pnp_cb() for HCA0 is correct.
> >>
> >> fdo_query_remove() [
> >> IRP_MN_QUERY_REMOVE_DEVICE IB Bus @ FDO FAB160E8 refs(CI 0 AL 0)
> >> bfi-0 CA 8025000002c90200 fdo_query_remove() ]
> >> __query_remove() ]
> >> cl_pnp(): IrpSkip/IrpIgnore: skipping down to PDO 81DDD420, ext
> >> FAB160E8, status 0
> >> cl_pnp(): returned with status 0
> >> cl_pnp() ]
> >> port_mgr_pnp_cb() [
> >> port_mgr_pnp_cb() ]
> >> port_mgr_pnp_cb() [
> >> port_mgr_port_remove() [
> >> bfi-0 ca_guid 0x8025000002c90200 port_num 1 port_mgr 81D46008
> >> port_mgr_port_remove(): Mark removing IPoIB: PDO 81DA3BD8, ext
> >> 81DA3C90, present 0, missing 0
> >> port_mgr_port_remove() ]
> >> port_mgr_pnp_cb() ]
> >> port_mgr_pnp_cb() [
> >> port_mgr_port_remove() [
> >> bfi-0 ca_guid 0x8025000002c90200 port_num 2 port_mgr 81D46008
> >> port_mgr_port_remove(): Mark removing IPoIB: PDO 81DA3350, ext
> >> 81DA3408, present 0, missing 0
> >> port_mgr_port_remove() ]
> >> port_mgr_pnp_cb() ]
> >> iou_mgr_pnp_cb() [
> >> iou_mgr_iou_remove() [
> >> bfi-0 ca_guid 0x8025000002c90200 iou_mgr FED74310
> >> iou_mgr_iou_remove(): bfi-0 IB IOU: ext FF58B7B8, present
> 0, missing
> >> 1 . iou_mgr_iou_remove() ]
> >> iou_mgr_pnp_cb() ]
> >>
> >> XXX - this PNP call for HCA1 should not of occurred when disabling
> >> HCA0.
> >>
> >> port_mgr_port_remove() [
> >> bfi-1 ca_guid 0xa425000002c90200 port_num 1 port_mgr 82030F40
> >> port_mgr_port_remove(): Mark removing IPoIB: PDO FED75DD8, ext
> >> FED75E90, present 0, missing 0
> >> port_mgr_port_remove() ]
> >> port_mgr_pnp_cb() ]
> >> port_mgr_pnp_cb() [
> >> port_mgr_port_remove() [
> >> bfi-1 ca_guid 0xa425000002c90200 port_num 2 port_mgr 82030F40
> >> port_mgr_port_remove(): Mark removing IPoIB: PDO FF881DD8, ext
> >> FF881E90, present 0, missing 0
> >> port_mgr_port_remove() ]
> >> port_mgr_pnp_cb() ]
> >> iou_mgr_pnp_cb() [
> >> iou_mgr_iou_remove() [
> >> bfi-1 ca_guid 0xa425000002c90200 iou_mgr 821A8A80
> >> iou_mgr_iou_remove(): bfi-1 IB IOU: ext FAC24620, present
> 0, missing
> >> 1 . iou_mgr_iou_remove() ]
> >> iou_mgr_pnp_cb() ]
> >>
> >> XXX end of badness...
> >>
> >> Any ideas on the reasons why the 2nd
> port_mgr_port_remove() call was
> >> invoked?
> >> Is there some binding between HCA1 IPoIB ports and HCA0?
> >>
> >> Thanks,
> >>
> >> Stan.
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: remove_registration.patch
Type: application/octet-stream
Size: 71154 bytes
Desc: remove_registration.patch
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20090305/4a1a1d4c/attachment.obj>
-------------- next part --------------
An embedded message was scrubbed...
From: "Fab Tillier" <ftillier at windows.microsoft.com>
Subject: RE: [ofw] RE: winof 2.0.2: crash in ibbus.sys when running whql testsonmlx4hca
Date: Mon, 23 Feb 2009 20:48:19 +0200
Size: 13407
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20090305/4a1a1d4c/attachment.mht>
More information about the ofw
mailing list