[ofw] RE: ibbus disable on HCA0 erroneously removes all IPoIB instances; including IPoIB ports on HCA1 ?

Leonid Keller leonid at mellanox.co.il
Thu Mar 5 06:15:33 PST 2009


Find attached a patch that removes registration HCA with IBAL.

It should have been done anyway independent of reported problems. 
Wrt the problems:

1.  (Stan) "ibbus disable on HCA0 erroneously removes all IPoIB
instances; including IPoIB ports on HCA1 ?"
This patch seems like solves this problem when working without
WinVerbs&WinMad.
With Win* drivers one can get a crash, playing disable/enable with
MLX4_HCA.
I believe, it doesn't related to the patch. I'll describe it in another
tread.

2. (Anatoly) "winof 2.0.2: crash in ibbus.sys when running whql
testsonmlx4hca"
I don't think this patch will cause MLX4_HCA to pass pnpdtest.
But may be the crash will go away. I don't know what exactly case of
pnpdtest caused the crash.
Anatoly, could you try it, having applied the patch ?


> -----Original Message-----
> From: Smith, Stan [mailto:stan.smith at intel.com] 
> Sent: Saturday, February 21, 2009 1:26 AM
> To: Leonid Keller
> Cc: ofw at lists.openfabrics.org
> Subject: RE: ibbus disable on HCA0 erroneously removes all 
> IPoIB instances; including IPoIB ports on HCA1 ?
> 
> Hello Leonid,
>   Thanks for taking the time to consider this curious 
> ibbus/ipoib problem.
> 
> I understand what you are saying w.r.t. flows, although I'm 
> confused in that disabling the 1st HCA used to only disable 
> the 1st two IPoIB instances, not all 4. So I have to ask 
> myself, what has changed to induce this new behavior? WHQL 
> patches perhaps?
> 
> The previous disable HCA-0 behavior, granted it was incorrect 
> for other reasons, disabled IPoIB instance 0 & 1, correctly 
> leaving IPoIB instances 2 & 3 alone. The problem was the user 
> accessible IBAL device was bound to HCA-0, hence IBAL access 
> was disabled when HCA-0 was disabled, even though HCA-1 was 
> alive and well; vstat stopped working.
> Upon discovery we talked about implementing a Control Device 
> Object which the user accessible IBAL device would be bound 
> to, thus allowing HCAs to come and go (disable, enable) 
> without breaking user-mode access to IBAL (provided there was 
> at least one HCA enabled).
> 
> With the current HCA-0 disable behavior, a Control Device 
> Object for the user-accessible IBAL device is not required as 
> disabling either HCA device requires a reboot.
> 
> About your comment 'And the right solution is to remove this 
> mechanism at all!', are you suggesting a conversion to KMDF 
> PNP framework?
> 
> Thanks,
> 
> Stan.
> 
> 
> Leonid Keller wrote:
> >  > Any ideas on the reasons why the 2nd port_mgr_port_remove() call 
> > was invoked?
> > To my guess, the problem is created by the HCA's mechanism of 
> > registration for IBAL arrival.
> > And the right solution is to remove this mechanism at all!
> >
> > Here is my theory:
> > To remind: IBAL historically was sitting under ROOT and could be 
> > loaded after HCA driver.
> > So HCA driver made registration with OS on the arrival of IBAL low 
> > interface.
> > And this code still works!
> > What happens in this case ?
> > Two HCA devices get started and make the above registration.
> > When the first IBAL instance is started, both HCA devices get 
> > notification about it and register themselves with this instance.
> > When we disable the first HCA, both HCAs get notification about the 
> > removing of IBAL (in __pnp_notify_ifc) and deregister HCA 
> from IBAL, 
> > which remove all IPoIB devices.
> >
> > So we have now two flows of removing device, which works 
> > simultaneously.
> > 1. The normal PnP flow, caused by IRP_MN_REMOVE_DEVICE:
> > ibbus!cl_pnp
> >         ibbus!__remove
> >                 ibbus!cl_do_remove              // Pass the IRP down
> >                         mlx4_hca!cl_pnp                 //
> > IRP_MN_REMOVE_DEVICE
> >                                 mlx4_hca!__remove
> >                                         mlx4_hca!cl_do_remove
> >
> > mlx4_hca!hca_release_resources
> >
> > mlx4_hca!__hca_release_resources
> >
> > mlx4_hca!__hca_deregister
> >
> > ibbus!ib_deregister_ca
> >
> > 2. A flow, caused       by the notification on IBAL interface remove
> >
> > mlx4_hca!__pnp_notify_ifc
> >         ibbus!ib_deregister_ca
> >
> > Just FYI: the flow of ib_deregister_ca:
> > ib_deregister_ca
> >         destroying_ci_ca                        // from
> > p_ci_ca->obj.pfn_destroy
> >                 sync_destroy_obj
> >                         destroy_obj
> >                                 destroying_ci_ca
> >                                         pnp_ca_event( p_ci_ca, 
> > IB_PNP_CA_REMOVE );
> >                                                 cl_async_proc_queue 
> > And in other thread:
> >         __pnp_process_remove_ca
> >                 __pnp_process_remove_port
> >                         __pnp_notify_user
> >                                 port_mgr_pnp_cb
> >                                         port_mgr_port_remove
> >
> >
> >> -----Original Message-----
> >> From: Smith, Stan [mailto:stan.smith at intel.com]
> >> Sent: Saturday, February 07, 2009 3:05 AM
> >> To: Leonid Keller
> >> Cc: ofw at lists.openfabrics.org
> >> Subject: ibbus disable on HCA0 erroneously removes all IPoIB 
> >> instances; including IPoIB ports on HCA1 ?
> >>
> >> Hello,
> >>   Recently I discovered some bad HCA disable behavior 
> which used to 
> >> work correctly?
> >>
> >> Has the disable behavior for HCA0 been changed recently 
> such that all 
> >> existing IPoIB instances for all HCAs are removed?
> >>
> >> Details:
> >>
> >> For an x86 system using svn.1932 mthca.sys & ibbus.sys with two Mx 
> >> MT23108 HCAs (1 port active, one port disconnected per 
> HCA), no WSD 
> >> or WinOF install, just bare mthca, ibbus & IPoIB.
> >>
> >> When both HCAs are enabled there are 4 IPoIB instances.
> >>
> >> When the 1st HCA as seen by PNP (HCA0 for discussion
> >> purposes) is disabled, all 4 IPoIB instances are removed from the 
> >> device manager view along with the expected HCA0 disabled.
> >> The 2nd HCA (HCA1) is still enabled with no IPoIB 
> instances shown by 
> >> the device manager.
> >>
> >> The expected behavior when disabling HCA0 should be the 
> 1st two IPoIB 
> >> instances [0 & 2] would be removed from the device manager 
> view, with 
> >> the 2nd two IPoIB instances [3 & 4] remaining.
> >> This is the case which exposes the ibbus bug where vstat no longer 
> >> works because \Devices\ibal has been removed as it's bound 
> to the 1st 
> >> PNP seen HCA which is now disabled.
> >>
> >> If you reverse the disable order, such that HCA1 is disabled while 
> >> HCA0 remains enabled, the expected IPoIB instances [3 & 4] are 
> >> removed; while instances [0 & 1] remain.
> >>
> >> The problem occurs when cl_pnp() calls
> >> ibbus::port_mgr_pnp_cb() to remove the IPoIB instances for 
> HCA1; the 
> >> previous call to ibbus::port_mgr_pnp_cb() for HCA0 is correct.
> >>
> >> fdo_query_remove() [
> >> IRP_MN_QUERY_REMOVE_DEVICE IB Bus @ FDO FAB160E8 refs(CI 0 AL 0) 
> >> bfi-0 CA 8025000002c90200 fdo_query_remove() ]
> >> __query_remove() ]
> >> cl_pnp(): IrpSkip/IrpIgnore: skipping down to PDO 81DDD420, ext 
> >> FAB160E8, status 0
> >> cl_pnp(): returned with status 0
> >> cl_pnp() ]
> >> port_mgr_pnp_cb() [
> >> port_mgr_pnp_cb() ]
> >> port_mgr_pnp_cb() [
> >> port_mgr_port_remove() [
> >> bfi-0 ca_guid 0x8025000002c90200 port_num 1 port_mgr 81D46008
> >> port_mgr_port_remove(): Mark removing IPoIB: PDO 81DA3BD8, ext 
> >> 81DA3C90, present 0, missing 0
> >> port_mgr_port_remove() ]
> >> port_mgr_pnp_cb() ]
> >> port_mgr_pnp_cb() [
> >> port_mgr_port_remove() [
> >> bfi-0 ca_guid 0x8025000002c90200 port_num 2 port_mgr 81D46008
> >> port_mgr_port_remove(): Mark removing IPoIB: PDO 81DA3350, ext 
> >> 81DA3408, present 0, missing 0
> >> port_mgr_port_remove() ]
> >> port_mgr_pnp_cb() ]
> >> iou_mgr_pnp_cb() [
> >> iou_mgr_iou_remove() [
> >> bfi-0 ca_guid 0x8025000002c90200 iou_mgr FED74310
> >> iou_mgr_iou_remove(): bfi-0 IB IOU: ext FF58B7B8, present 
> 0, missing
> >> 1 . iou_mgr_iou_remove() ]
> >> iou_mgr_pnp_cb() ]
> >>
> >> XXX - this PNP call for HCA1 should not of occurred when disabling 
> >> HCA0.
> >>
> >> port_mgr_port_remove() [
> >> bfi-1 ca_guid 0xa425000002c90200 port_num 1 port_mgr 82030F40
> >> port_mgr_port_remove(): Mark removing IPoIB: PDO FED75DD8, ext 
> >> FED75E90, present 0, missing 0
> >> port_mgr_port_remove() ]
> >> port_mgr_pnp_cb() ]
> >> port_mgr_pnp_cb() [
> >> port_mgr_port_remove() [
> >> bfi-1 ca_guid 0xa425000002c90200 port_num 2 port_mgr 82030F40
> >> port_mgr_port_remove(): Mark removing IPoIB: PDO FF881DD8, ext 
> >> FF881E90, present 0, missing 0
> >> port_mgr_port_remove() ]
> >> port_mgr_pnp_cb() ]
> >> iou_mgr_pnp_cb() [
> >> iou_mgr_iou_remove() [
> >> bfi-1 ca_guid 0xa425000002c90200 iou_mgr 821A8A80
> >> iou_mgr_iou_remove(): bfi-1 IB IOU: ext FAC24620, present 
> 0, missing
> >> 1 . iou_mgr_iou_remove() ]
> >> iou_mgr_pnp_cb() ]
> >>
> >> XXX end of badness...
> >>
> >> Any ideas on the reasons why the 2nd 
> port_mgr_port_remove() call was 
> >> invoked?
> >> Is there some binding between HCA1 IPoIB ports and HCA0?
> >>
> >> Thanks,
> >>
> >> Stan.
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: remove_registration.patch
Type: application/octet-stream
Size: 71154 bytes
Desc: remove_registration.patch
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20090305/4a1a1d4c/attachment.obj>
-------------- next part --------------
An embedded message was scrubbed...
From: "Fab Tillier" <ftillier at windows.microsoft.com>
Subject: RE: [ofw] RE: winof 2.0.2: crash in ibbus.sys when running whql	testsonmlx4hca
Date: Mon, 23 Feb 2009 20:48:19 +0200
Size: 13407
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20090305/4a1a1d4c/attachment.mht>


More information about the ofw mailing list