[ofw] RE: ibbus disable on HCA0 erroneously removes all IPoIB instances; including IPoIB ports on HCA1 ?

Leonid Keller leonid at mellanox.co.il
Tue Mar 17 07:29:33 PDT 2009


Hi Stan,
I somehow missed your e-mail.
I'm OK with patches, you may commit them.

> -----Original Message-----
> From: Reuven Amitai 
> Sent: Tuesday, March 17, 2009 3:53 PM
> To: Leonid Keller
> Subject: FW: [ofw] RE: ibbus disable on HCA0 erroneously 
> removes all IPoIB instances; including IPoIB ports on HCA1 ?
> 
>  
> 
> -----Original Message-----
> From: ofw-bounces at lists.openfabrics.org 
> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Smith, Stan
> Sent: Thursday, March 12, 2009 3:33 AM
> To: Leonid Keller; Anatoly Greenblatt
> Cc: ofw at lists.openfabrics.org
> Subject: [ofw] RE: ibbus disable on HCA0 erroneously removes 
> all IPoIB instances; including IPoIB ports on HCA1 ?
> 
> Hello,
>   In testing your patches, which I now discover have been 
> checked into svn, I discovered some good news and some not so 
> good news: (Winverbs & winmad filters enabled during testing)
> 
> 1) Your changes to bus_pnp.c in fdo_start() allowed me to fix 
> a long standing sore spot in the late binding of the HCA to a 
> BFI (Bus Filter Instance); please see attached patch files.
> BFI is now bound to HCA in fdo_start() along with 
> get_set_bfi_by_hca_guid() being replaced by get_bfi_by_hca_guid().
>  Bus_driver.h mods are formatting (ts=4).
> 
> 2) The mthca driver crashes during system shutdown in 
> mthca_query_device(); please see attached patch files.
>  hca_pnp.c mods are whitespace formatting (ts=4) along with 
> the correct driver name.
>  mthca_provider.c mods in mthca_query_device(), 
> mthca_is_livefish(mdev) returns TRUE when mdev == NULL, so 
> the following dereference exploded.
>         props->vendor_id = mdev->ext->hcaConfig.VendorID;
> 
> I'm not sure how mdev ends up null? Shutdown timing perhaps?
> 
> I suspect similar mods to the mlx4 driver will need to be performed.
> 
> 3) I believe there are al_ifc reference counting problems 
> although they do not seem to cause observable problems.
> 
> If you approve of the ibbus.sys mods I will svn commit upon your OK.
> 
> Stan.
> 
> 
> Leonid Keller wrote:
> > Find attached a patch that removes registration HCA with IBAL.
> >
> > It should have been done anyway independent of reported problems.
> > Wrt the problems:
> >
> > 1.  (Stan) "ibbus disable on HCA0 erroneously removes all IPoIB 
> > instances; including IPoIB ports on HCA1 ?"
> > This patch seems like solves this problem when working without 
> > WinVerbs&WinMad.
> > With Win* drivers one can get a crash, playing disable/enable with 
> > MLX4_HCA.
> > I believe, it doesn't related to the patch. I'll describe it in 
> > another tread.
> >
> > 2. (Anatoly) "winof 2.0.2: crash in ibbus.sys when running whql 
> > testsonmlx4hca"
> > I don't think this patch will cause MLX4_HCA to pass pnpdtest.
> > But may be the crash will go away. I don't know what 
> exactly case of 
> > pnpdtest caused the crash.
> > Anatoly, could you try it, having applied the patch ?
> >
> >
> >> -----Original Message-----
> >> From: Smith, Stan [mailto:stan.smith at intel.com]
> >> Sent: Saturday, February 21, 2009 1:26 AM
> >> To: Leonid Keller
> >> Cc: ofw at lists.openfabrics.org
> >> Subject: RE: ibbus disable on HCA0 erroneously removes all IPoIB 
> >> instances; including IPoIB ports on HCA1 ?
> >>
> >> Hello Leonid,
> >>   Thanks for taking the time to consider this curious ibbus/ipoib 
> >> problem.
> >>
> >> I understand what you are saying w.r.t. flows, although 
> I'm confused 
> >> in that disabling the 1st HCA used to only disable the 1st 
> two IPoIB 
> >> instances, not all 4. So I have to ask myself, what has changed to 
> >> induce this new behavior? WHQL patches perhaps?
> >>
> >> The previous disable HCA-0 behavior, granted it was incorrect for 
> >> other reasons, disabled IPoIB instance 0 & 1, correctly 
> leaving IPoIB 
> >> instances 2 & 3 alone. The problem was the user accessible IBAL 
> >> device was bound to HCA-0, hence IBAL access was disabled 
> when HCA-0 
> >> was disabled, even though HCA-1 was alive and well; vstat stopped 
> >> working.
> >> Upon discovery we talked about implementing a Control 
> Device Object 
> >> which the user accessible IBAL device would be bound to, thus 
> >> allowing HCAs to come and go (disable, enable) without breaking 
> >> user-mode access to IBAL (provided there was at least one HCA 
> >> enabled).
> >>
> >> With the current HCA-0 disable behavior, a Control Device 
> Object for 
> >> the user-accessible IBAL device is not required as 
> disabling either 
> >> HCA device requires a reboot.
> >>
> >> About your comment 'And the right solution is to remove this 
> >> mechanism at all!', are you suggesting a conversion to KMDF PNP 
> >> framework?
> >>
> >> Thanks,
> >>
> >> Stan.
> >>
> >>
> >> Leonid Keller wrote:
> >>>  > Any ideas on the reasons why the 2nd 
> port_mgr_port_remove() call 
> >>> was invoked? To my guess, the problem is created by the HCA's 
> >>> mechanism of registration for IBAL arrival.
> >>> And the right solution is to remove this mechanism at all!
> >>>
> >>> Here is my theory:
> >>> To remind: IBAL historically was sitting under ROOT and could be 
> >>> loaded after HCA driver. So HCA driver made registration 
> with OS on 
> >>> the arrival of IBAL low interface. And this code still works!
> >>> What happens in this case ?
> >>> Two HCA devices get started and make the above registration.
> >>> When the first IBAL instance is started, both HCA devices get 
> >>> notification about it and register themselves with this instance.
> >>> When we disable the first HCA, both HCAs get notification 
> about the 
> >>> removing of IBAL (in __pnp_notify_ifc) and deregister HCA 
> from IBAL, 
> >>> which remove all IPoIB devices.
> >>>
> >>> So we have now two flows of removing device, which works 
> >>> simultaneously.
> >>> 1. The normal PnP flow, caused by IRP_MN_REMOVE_DEVICE: 
> ibbus!cl_pnp
> >>>         ibbus!__remove
> >>>                 ibbus!cl_do_remove              // Pass 
> the IRP down
> >>>                         mlx4_hca!cl_pnp                 //
> >>>                                 IRP_MN_REMOVE_DEVICE
> >>>                                         mlx4_hca!__remove 
> >>> mlx4_hca!cl_do_remove
> >>>
> >>> mlx4_hca!hca_release_resources
> >>>
> >>> mlx4_hca!__hca_release_resources
> >>>
> >>> mlx4_hca!__hca_deregister
> >>>
> >>> ibbus!ib_deregister_ca
> >>>
> >>> 2. A flow, caused       by the notification on IBAL 
> interface remove
> >>>
> >>> mlx4_hca!__pnp_notify_ifc
> >>>         ibbus!ib_deregister_ca
> >>>
> >>> Just FYI: the flow of ib_deregister_ca:
> >>> ib_deregister_ca
> >>>         destroying_ci_ca                        // from
> >>>                 p_ci_ca->obj.pfn_destroy sync_destroy_obj
> >>>                         destroy_obj
> >>>                                 destroying_ci_ca
> >>>                                         pnp_ca_event( p_ci_ca,
> >>>                                                 
> IB_PNP_CA_REMOVE );
> >>>         cl_async_proc_queue And in other thread:
> >>>                 __pnp_process_remove_ca __pnp_process_remove_port
> >>>                         __pnp_notify_user
> >>>                                 port_mgr_pnp_cb
> >>>                                         port_mgr_port_remove
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Smith, Stan [mailto:stan.smith at intel.com]
> >>>> Sent: Saturday, February 07, 2009 3:05 AM
> >>>> To: Leonid Keller
> >>>> Cc: ofw at lists.openfabrics.org
> >>>> Subject: ibbus disable on HCA0 erroneously removes all IPoIB 
> >>>> instances; including IPoIB ports on HCA1 ?
> >>>>
> >>>> Hello,
> >>>>   Recently I discovered some bad HCA disable behavior 
> which used to 
> >>>> work correctly?
> >>>>
> >>>> Has the disable behavior for HCA0 been changed recently 
> such that 
> >>>> all existing IPoIB instances for all HCAs are removed?
> >>>>
> >>>> Details:
> >>>>
> >>>> For an x86 system using svn.1932 mthca.sys & ibbus.sys 
> with two Mx
> >>>> MT23108 HCAs (1 port active, one port disconnected per 
> HCA), no WSD 
> >>>> or WinOF install, just bare mthca, ibbus & IPoIB.
> >>>>
> >>>> When both HCAs are enabled there are 4 IPoIB instances.
> >>>>
> >>>> When the 1st HCA as seen by PNP (HCA0 for discussion
> >>>> purposes) is disabled, all 4 IPoIB instances are removed 
> from the 
> >>>> device manager view along with the expected HCA0 disabled.
> >>>> The 2nd HCA (HCA1) is still enabled with no IPoIB 
> instances shown 
> >>>> by the device manager.
> >>>>
> >>>> The expected behavior when disabling HCA0 should be the 1st two 
> >>>> IPoIB instances [0 & 2] would be removed from the device manager 
> >>>> view, with the 2nd two IPoIB instances [3 & 4] remaining.
> >>>> This is the case which exposes the ibbus bug where vstat 
> no longer 
> >>>> works because \Devices\ibal has been removed as it's 
> bound to the 
> >>>> 1st PNP seen HCA which is now disabled.
> >>>>
> >>>> If you reverse the disable order, such that HCA1 is 
> disabled while 
> >>>> HCA0 remains enabled, the expected IPoIB instances [3 & 4] are 
> >>>> removed; while instances [0 & 1] remain.
> >>>>
> >>>> The problem occurs when cl_pnp() calls
> >>>> ibbus::port_mgr_pnp_cb() to remove the IPoIB instances for HCA1; 
> >>>> the previous call to ibbus::port_mgr_pnp_cb() for HCA0 
> is correct.
> >>>>
> >>>> fdo_query_remove() [
> >>>> IRP_MN_QUERY_REMOVE_DEVICE IB Bus @ FDO FAB160E8 refs(CI 0 AL 0) 
> >>>> bfi-0 CA 8025000002c90200 fdo_query_remove() ]
> >>>> __query_remove() ]
> >>>> cl_pnp(): IrpSkip/IrpIgnore: skipping down to PDO 81DDD420, ext 
> >>>> FAB160E8, status 0 cl_pnp(): returned with status 0
> >>>> cl_pnp() ]
> >>>> port_mgr_pnp_cb() [
> >>>> port_mgr_pnp_cb() ]
> >>>> port_mgr_pnp_cb() [
> >>>> port_mgr_port_remove() [
> >>>> bfi-0 ca_guid 0x8025000002c90200 port_num 1 port_mgr 81D46008
> >>>> port_mgr_port_remove(): Mark removing IPoIB: PDO 81DA3BD8, ext 
> >>>> 81DA3C90, present 0, missing 0
> >>>> port_mgr_port_remove() ]
> >>>> port_mgr_pnp_cb() ]
> >>>> port_mgr_pnp_cb() [
> >>>> port_mgr_port_remove() [
> >>>> bfi-0 ca_guid 0x8025000002c90200 port_num 2 port_mgr 81D46008
> >>>> port_mgr_port_remove(): Mark removing IPoIB: PDO 81DA3350, ext 
> >>>> 81DA3408, present 0, missing 0
> >>>> port_mgr_port_remove() ]
> >>>> port_mgr_pnp_cb() ]
> >>>> iou_mgr_pnp_cb() [
> >>>> iou_mgr_iou_remove() [
> >>>> bfi-0 ca_guid 0x8025000002c90200 iou_mgr FED74310
> >>>> iou_mgr_iou_remove(): bfi-0 IB IOU: ext FF58B7B8, present 0, 
> >>>> missing 1 . iou_mgr_iou_remove() ] iou_mgr_pnp_cb() ]
> >>>>
> >>>> XXX - this PNP call for HCA1 should not of occurred when 
> disabling 
> >>>> HCA0.
> >>>>
> >>>> port_mgr_port_remove() [
> >>>> bfi-1 ca_guid 0xa425000002c90200 port_num 1 port_mgr 82030F40
> >>>> port_mgr_port_remove(): Mark removing IPoIB: PDO FED75DD8, ext 
> >>>> FED75E90, present 0, missing 0
> >>>> port_mgr_port_remove() ]
> >>>> port_mgr_pnp_cb() ]
> >>>> port_mgr_pnp_cb() [
> >>>> port_mgr_port_remove() [
> >>>> bfi-1 ca_guid 0xa425000002c90200 port_num 2 port_mgr 82030F40
> >>>> port_mgr_port_remove(): Mark removing IPoIB: PDO FF881DD8, ext 
> >>>> FF881E90, present 0, missing 0
> >>>> port_mgr_port_remove() ]
> >>>> port_mgr_pnp_cb() ]
> >>>> iou_mgr_pnp_cb() [
> >>>> iou_mgr_iou_remove() [
> >>>> bfi-1 ca_guid 0xa425000002c90200 iou_mgr 821A8A80
> >>>> iou_mgr_iou_remove(): bfi-1 IB IOU: ext FAC24620, present 0, 
> >>>> missing 1 . iou_mgr_iou_remove() ] iou_mgr_pnp_cb() ]
> >>>>
> >>>> XXX end of badness...
> >>>>
> >>>> Any ideas on the reasons why the 2nd
> >> port_mgr_port_remove() call was
> >>>> invoked?
> >>>> Is there some binding between HCA1 IPoIB ports and HCA0?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Stan.
> 
> 



More information about the ofw mailing list