[ofw] RE: ibbus disable on HCA0 erroneously removes all IPoIB instances; including IPoIB ports on HCA1 ?

Leonid Keller leonid at mellanox.co.il
Tue Mar 24 05:31:15 PDT 2009


See inline 

> -----Original Message-----
> From: Smith, Stan [mailto:stan.smith at intel.com] 
> Sent: Thursday, March 12, 2009 3:33 AM
> To: Leonid Keller; Anatoly Greenblatt
> Cc: ofw at lists.openfabrics.org
> Subject: RE: ibbus disable on HCA0 erroneously removes all 
> IPoIB instances; including IPoIB ports on HCA1 ?
> 
> Hello,
>   In testing your patches, which I now discover have been 
> checked into svn, I discovered some good news and some not so 
> good news: (Winverbs & winmad filters enabled during testing)
> 
> 1) Your changes to bus_pnp.c in fdo_start() allowed me to fix 
> a long standing sore spot in the late binding of the HCA to a 
> BFI (Bus Filter Instance); please see attached patch files.
> BFI is now bound to HCA in fdo_start() along with 
> get_set_bfi_by_hca_guid() being replaced by get_bfi_by_hca_guid().
>  Bus_driver.h mods are formatting (ts=4).
> 
> 2) The mthca driver crashes during system shutdown in 
> mthca_query_device(); please see attached patch files.
>  hca_pnp.c mods are whitespace formatting (ts=4) along with 
> the correct driver name.
>  mthca_provider.c mods in mthca_query_device(), 
> mthca_is_livefish(mdev) returns TRUE when mdev == NULL, so 
> the following dereference exploded.
>         props->vendor_id = mdev->ext->hcaConfig.VendorID;
> 
> I'm not sure how mdev ends up null? Shutdown timing perhaps?

I'd suggest another explanation.
Before my patch upon PowerDown MTHCA used to perform first
deregistration from IBAL and then removing of the device.
Now it just removes the device (see call to mthca_remove_one() in
__DevicePowerDownWorkItem).
mthca_remove_one() sets mdev to 0 and then removes the device.
IBAL proceeds to work and its CA polling thread performs query_ca which
ultimately comes to mthca_query_device() with mdev = 0.
I guess you can check this theory by outcommenting the call to
mthca_remove_one in DevicePowerDownWorkItem.

> 
> I suspect similar mods to the mlx4 driver will need to be performed.
> 
> 3) I believe there are al_ifc reference counting problems 
> although they do not seem to cause observable problems.
> 
> If you approve of the ibbus.sys mods I will svn commit upon your OK.
> 
> Stan.
> 
> 
> Leonid Keller wrote:
> > Find attached a patch that removes registration HCA with IBAL.
> >
> > It should have been done anyway independent of reported problems.
> > Wrt the problems:
> >
> > 1.  (Stan) "ibbus disable on HCA0 erroneously removes all IPoIB 
> > instances; including IPoIB ports on HCA1 ?"
> > This patch seems like solves this problem when working without 
> > WinVerbs&WinMad.
> > With Win* drivers one can get a crash, playing disable/enable with 
> > MLX4_HCA.
> > I believe, it doesn't related to the patch. I'll describe it in 
> > another tread.
> >
> > 2. (Anatoly) "winof 2.0.2: crash in ibbus.sys when running whql 
> > testsonmlx4hca"
> > I don't think this patch will cause MLX4_HCA to pass pnpdtest.
> > But may be the crash will go away. I don't know what 
> exactly case of 
> > pnpdtest caused the crash.
> > Anatoly, could you try it, having applied the patch ?
> >
> >
> >> -----Original Message-----
> >> From: Smith, Stan [mailto:stan.smith at intel.com]
> >> Sent: Saturday, February 21, 2009 1:26 AM
> >> To: Leonid Keller
> >> Cc: ofw at lists.openfabrics.org
> >> Subject: RE: ibbus disable on HCA0 erroneously removes all IPoIB 
> >> instances; including IPoIB ports on HCA1 ?
> >>
> >> Hello Leonid,
> >>   Thanks for taking the time to consider this curious ibbus/ipoib 
> >> problem.
> >>
> >> I understand what you are saying w.r.t. flows, although 
> I'm confused 
> >> in that disabling the 1st HCA used to only disable the 1st 
> two IPoIB 
> >> instances, not all 4. So I have to ask myself, what has changed to 
> >> induce this new behavior? WHQL patches perhaps?
> >>
> >> The previous disable HCA-0 behavior, granted it was incorrect for 
> >> other reasons, disabled IPoIB instance 0 & 1, correctly 
> leaving IPoIB 
> >> instances 2 & 3 alone. The problem was the user accessible IBAL 
> >> device was bound to HCA-0, hence IBAL access was disabled 
> when HCA-0 
> >> was disabled, even though HCA-1 was alive and well; vstat stopped 
> >> working.
> >> Upon discovery we talked about implementing a Control 
> Device Object 
> >> which the user accessible IBAL device would be bound to, thus 
> >> allowing HCAs to come and go (disable, enable) without breaking 
> >> user-mode access to IBAL (provided there was at least one HCA 
> >> enabled).
> >>
> >> With the current HCA-0 disable behavior, a Control Device 
> Object for 
> >> the user-accessible IBAL device is not required as 
> disabling either 
> >> HCA device requires a reboot.
> >>
> >> About your comment 'And the right solution is to remove this 
> >> mechanism at all!', are you suggesting a conversion to KMDF PNP 
> >> framework?
> >>
> >> Thanks,
> >>
> >> Stan.
> >>
> >>
> >> Leonid Keller wrote:
> >>>  > Any ideas on the reasons why the 2nd 
> port_mgr_port_remove() call 
> >>> was invoked? To my guess, the problem is created by the HCA's 
> >>> mechanism of registration for IBAL arrival.
> >>> And the right solution is to remove this mechanism at all!
> >>>
> >>> Here is my theory:
> >>> To remind: IBAL historically was sitting under ROOT and could be 
> >>> loaded after HCA driver. So HCA driver made registration 
> with OS on 
> >>> the arrival of IBAL low interface. And this code still works!
> >>> What happens in this case ?
> >>> Two HCA devices get started and make the above registration.
> >>> When the first IBAL instance is started, both HCA devices get 
> >>> notification about it and register themselves with this instance.
> >>> When we disable the first HCA, both HCAs get notification 
> about the 
> >>> removing of IBAL (in __pnp_notify_ifc) and deregister HCA 
> from IBAL, 
> >>> which remove all IPoIB devices.
> >>>
> >>> So we have now two flows of removing device, which works 
> >>> simultaneously.
> >>> 1. The normal PnP flow, caused by IRP_MN_REMOVE_DEVICE: 
> ibbus!cl_pnp
> >>>         ibbus!__remove
> >>>                 ibbus!cl_do_remove              // Pass 
> the IRP down
> >>>                         mlx4_hca!cl_pnp                 //
> >>>                                 IRP_MN_REMOVE_DEVICE
> >>>                                         mlx4_hca!__remove 
> >>> mlx4_hca!cl_do_remove
> >>>
> >>> mlx4_hca!hca_release_resources
> >>>
> >>> mlx4_hca!__hca_release_resources
> >>>
> >>> mlx4_hca!__hca_deregister
> >>>
> >>> ibbus!ib_deregister_ca
> >>>
> >>> 2. A flow, caused       by the notification on IBAL 
> interface remove
> >>>
> >>> mlx4_hca!__pnp_notify_ifc
> >>>         ibbus!ib_deregister_ca
> >>>
> >>> Just FYI: the flow of ib_deregister_ca:
> >>> ib_deregister_ca
> >>>         destroying_ci_ca                        // from
> >>>                 p_ci_ca->obj.pfn_destroy sync_destroy_obj
> >>>                         destroy_obj
> >>>                                 destroying_ci_ca
> >>>                                         pnp_ca_event( p_ci_ca,
> >>>                                                 
> IB_PNP_CA_REMOVE );
> >>>         cl_async_proc_queue And in other thread:
> >>>                 __pnp_process_remove_ca __pnp_process_remove_port
> >>>                         __pnp_notify_user
> >>>                                 port_mgr_pnp_cb
> >>>                                         port_mgr_port_remove
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Smith, Stan [mailto:stan.smith at intel.com]
> >>>> Sent: Saturday, February 07, 2009 3:05 AM
> >>>> To: Leonid Keller
> >>>> Cc: ofw at lists.openfabrics.org
> >>>> Subject: ibbus disable on HCA0 erroneously removes all IPoIB 
> >>>> instances; including IPoIB ports on HCA1 ?
> >>>>
> >>>> Hello,
> >>>>   Recently I discovered some bad HCA disable behavior 
> which used to 
> >>>> work correctly?
> >>>>
> >>>> Has the disable behavior for HCA0 been changed recently 
> such that 
> >>>> all existing IPoIB instances for all HCAs are removed?
> >>>>
> >>>> Details:
> >>>>
> >>>> For an x86 system using svn.1932 mthca.sys & ibbus.sys 
> with two Mx
> >>>> MT23108 HCAs (1 port active, one port disconnected per 
> HCA), no WSD 
> >>>> or WinOF install, just bare mthca, ibbus & IPoIB.
> >>>>
> >>>> When both HCAs are enabled there are 4 IPoIB instances.
> >>>>
> >>>> When the 1st HCA as seen by PNP (HCA0 for discussion
> >>>> purposes) is disabled, all 4 IPoIB instances are removed 
> from the 
> >>>> device manager view along with the expected HCA0 disabled.
> >>>> The 2nd HCA (HCA1) is still enabled with no IPoIB 
> instances shown 
> >>>> by the device manager.
> >>>>
> >>>> The expected behavior when disabling HCA0 should be the 1st two 
> >>>> IPoIB instances [0 & 2] would be removed from the device manager 
> >>>> view, with the 2nd two IPoIB instances [3 & 4] remaining.
> >>>> This is the case which exposes the ibbus bug where vstat 
> no longer 
> >>>> works because \Devices\ibal has been removed as it's 
> bound to the 
> >>>> 1st PNP seen HCA which is now disabled.
> >>>>
> >>>> If you reverse the disable order, such that HCA1 is 
> disabled while 
> >>>> HCA0 remains enabled, the expected IPoIB instances [3 & 4] are 
> >>>> removed; while instances [0 & 1] remain.
> >>>>
> >>>> The problem occurs when cl_pnp() calls
> >>>> ibbus::port_mgr_pnp_cb() to remove the IPoIB instances for HCA1; 
> >>>> the previous call to ibbus::port_mgr_pnp_cb() for HCA0 
> is correct.
> >>>>
> >>>> fdo_query_remove() [
> >>>> IRP_MN_QUERY_REMOVE_DEVICE IB Bus @ FDO FAB160E8 refs(CI 0 AL 0) 
> >>>> bfi-0 CA 8025000002c90200 fdo_query_remove() ]
> >>>> __query_remove() ]
> >>>> cl_pnp(): IrpSkip/IrpIgnore: skipping down to PDO 81DDD420, ext 
> >>>> FAB160E8, status 0 cl_pnp(): returned with status 0
> >>>> cl_pnp() ]
> >>>> port_mgr_pnp_cb() [
> >>>> port_mgr_pnp_cb() ]
> >>>> port_mgr_pnp_cb() [
> >>>> port_mgr_port_remove() [
> >>>> bfi-0 ca_guid 0x8025000002c90200 port_num 1 port_mgr 81D46008
> >>>> port_mgr_port_remove(): Mark removing IPoIB: PDO 81DA3BD8, ext 
> >>>> 81DA3C90, present 0, missing 0
> >>>> port_mgr_port_remove() ]
> >>>> port_mgr_pnp_cb() ]
> >>>> port_mgr_pnp_cb() [
> >>>> port_mgr_port_remove() [
> >>>> bfi-0 ca_guid 0x8025000002c90200 port_num 2 port_mgr 81D46008
> >>>> port_mgr_port_remove(): Mark removing IPoIB: PDO 81DA3350, ext 
> >>>> 81DA3408, present 0, missing 0
> >>>> port_mgr_port_remove() ]
> >>>> port_mgr_pnp_cb() ]
> >>>> iou_mgr_pnp_cb() [
> >>>> iou_mgr_iou_remove() [
> >>>> bfi-0 ca_guid 0x8025000002c90200 iou_mgr FED74310
> >>>> iou_mgr_iou_remove(): bfi-0 IB IOU: ext FF58B7B8, present 0, 
> >>>> missing 1 . iou_mgr_iou_remove() ] iou_mgr_pnp_cb() ]
> >>>>
> >>>> XXX - this PNP call for HCA1 should not of occurred when 
> disabling 
> >>>> HCA0.
> >>>>
> >>>> port_mgr_port_remove() [
> >>>> bfi-1 ca_guid 0xa425000002c90200 port_num 1 port_mgr 82030F40
> >>>> port_mgr_port_remove(): Mark removing IPoIB: PDO FED75DD8, ext 
> >>>> FED75E90, present 0, missing 0
> >>>> port_mgr_port_remove() ]
> >>>> port_mgr_pnp_cb() ]
> >>>> port_mgr_pnp_cb() [
> >>>> port_mgr_port_remove() [
> >>>> bfi-1 ca_guid 0xa425000002c90200 port_num 2 port_mgr 82030F40
> >>>> port_mgr_port_remove(): Mark removing IPoIB: PDO FF881DD8, ext 
> >>>> FF881E90, present 0, missing 0
> >>>> port_mgr_port_remove() ]
> >>>> port_mgr_pnp_cb() ]
> >>>> iou_mgr_pnp_cb() [
> >>>> iou_mgr_iou_remove() [
> >>>> bfi-1 ca_guid 0xa425000002c90200 iou_mgr 821A8A80
> >>>> iou_mgr_iou_remove(): bfi-1 IB IOU: ext FAC24620, present 0, 
> >>>> missing 1 . iou_mgr_iou_remove() ] iou_mgr_pnp_cb() ]
> >>>>
> >>>> XXX end of badness...
> >>>>
> >>>> Any ideas on the reasons why the 2nd
> >> port_mgr_port_remove() call was
> >>>> invoked?
> >>>> Is there some binding between HCA1 IPoIB ports and HCA0?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Stan.
> 
> 



More information about the ofw mailing list