[ofw] RE: ibbus disable on HCA0 erroneously removes all IPoIB instances; including IPoIB ports on HCA1 ?
Smith, Stan
stan.smith at intel.com
Tue Mar 10 07:22:20 PDT 2009
Hello,
Initial building and functional DAPL/HPC tests are proceeding OK; to be expected.
Will be investigating HCA disable interactions Wednesday.
Stan.
Leonid Keller wrote:
> Applied in 2019.
>
>> -----Original Message-----
>> From: Leonid Keller
>> Sent: Thursday, March 05, 2009 4:16 PM
>> To: 'Smith, Stan'; 'Anatoly Greenblatt'
>> Cc: ofw at lists.openfabrics.org
>> Subject: RE: ibbus disable on HCA0 erroneously removes all
>> IPoIB instances; including IPoIB ports on HCA1 ?
>>
>> Find attached a patch that removes registration HCA with IBAL.
>>
>> It should have been done anyway independent of reported problems.
>> Wrt the problems:
>>
>> 1. (Stan) "ibbus disable on HCA0 erroneously removes all
>> IPoIB instances; including IPoIB ports on HCA1 ?"
>> This patch seems like solves this problem when working
>> without WinVerbs&WinMad.
>> With Win* drivers one can get a crash, playing disable/enable with
>> MLX4_HCA. I believe, it doesn't related to the patch. I'll describe
>> it
>> in another tread.
>>
>> 2. (Anatoly) "winof 2.0.2: crash in ibbus.sys when running
>> whql testsonmlx4hca"
>> I don't think this patch will cause MLX4_HCA to pass pnpdtest.
>> But may be the crash will go away. I don't know what exactly
>> case of pnpdtest caused the crash.
>> Anatoly, could you try it, having applied the patch ?
>>
>>
>>> -----Original Message-----
>>> From: Smith, Stan [mailto:stan.smith at intel.com]
>>> Sent: Saturday, February 21, 2009 1:26 AM
>>> To: Leonid Keller
>>> Cc: ofw at lists.openfabrics.org
>>> Subject: RE: ibbus disable on HCA0 erroneously removes all IPoIB
>>> instances; including IPoIB ports on HCA1 ?
>>>
>>> Hello Leonid,
>>> Thanks for taking the time to consider this curious ibbus/ipoib
>>> problem.
>>>
>>> I understand what you are saying w.r.t. flows, although I'm confused
>>> in that disabling the 1st HCA used to only disable the 1st two IPoIB
>>> instances, not all 4. So I have to ask myself, what has changed to
>>> induce this new behavior? WHQL patches perhaps?
>>>
>>> The previous disable HCA-0 behavior, granted it was incorrect for
>>> other reasons, disabled IPoIB instance 0 & 1, correctly leaving
>>> IPoIB instances 2 & 3 alone. The problem was the user accessible
>>> IBAL device was bound to HCA-0, hence IBAL access was disabled when
>>> HCA-0 was disabled, even though HCA-1 was alive and well; vstat
>>> stopped working. Upon discovery we talked about implementing a
>>> Control Device Object which the user accessible IBAL device would
>>> be bound to, thus allowing HCAs to come and go (disable, enable)
>>> without breaking user-mode access to IBAL (provided there was at
>>> least one HCA enabled).
>>>
>>> With the current HCA-0 disable behavior, a Control Device Object for
>>> the user-accessible IBAL device is not required as disabling either
>>> HCA device requires a reboot.
>>>
>>> About your comment 'And the right solution is to remove this
>>> mechanism at all!', are you suggesting a conversion to KMDF PNP
>>> framework?
>>>
>>> Thanks,
>>>
>>> Stan.
>>>
>>>
>>> Leonid Keller wrote:
>>>> > Any ideas on the reasons why the 2nd port_mgr_port_remove()
>>>> call was invoked? To my guess, the problem is created by the HCA's
>>>> mechanism of registration for IBAL arrival.
>>>> And the right solution is to remove this mechanism at all!
>>>>
>>>> Here is my theory:
>>>> To remind: IBAL historically was sitting under ROOT and could be
>>>> loaded after HCA driver. So HCA driver made registration with OS
>>>> on the arrival of IBAL low interface. And this code still works!
>>>> What happens in this case ?
>>>> Two HCA devices get started and make the above registration.
>>>> When the first IBAL instance is started, both HCA devices get
>>>> notification about it and register themselves with this instance.
>>>> When we disable the first HCA, both HCAs get notification about the
>>>> removing of IBAL (in __pnp_notify_ifc) and deregister HCA from
>>>> IBAL, which remove all IPoIB devices.
>>>>
>>>> So we have now two flows of removing device, which works
>>>> simultaneously.
>>>> 1. The normal PnP flow, caused by IRP_MN_REMOVE_DEVICE:
>>>> ibbus!cl_pnp ibbus!__remove
>>>> ibbus!cl_do_remove // Pass the IRP
>>>> down mlx4_hca!cl_pnp //
>>>> IRP_MN_REMOVE_DEVICE
>>>> mlx4_hca!__remove
>>>> mlx4_hca!cl_do_remove
>>>>
>>>> mlx4_hca!hca_release_resources
>>>>
>>>> mlx4_hca!__hca_release_resources
>>>>
>>>> mlx4_hca!__hca_deregister
>>>>
>>>> ibbus!ib_deregister_ca
>>>>
>>>> 2. A flow, caused by the notification on IBAL interface
>>>> remove
>>>>
>>>> mlx4_hca!__pnp_notify_ifc
>>>> ibbus!ib_deregister_ca
>>>>
>>>> Just FYI: the flow of ib_deregister_ca:
>>>> ib_deregister_ca
>>>> destroying_ci_ca // from
>>>> p_ci_ca->obj.pfn_destroy sync_destroy_obj
>>>> destroy_obj
>>>> destroying_ci_ca
>>>> pnp_ca_event( p_ci_ca,
>>>> IB_PNP_CA_REMOVE );
>>>>
>> cl_async_proc_queue
>>>> And in other thread:
>>>> __pnp_process_remove_ca
>>>> __pnp_process_remove_port
>>>> __pnp_notify_user
>>>> port_mgr_pnp_cb
>>>> port_mgr_port_remove
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Smith, Stan [mailto:stan.smith at intel.com]
>>>>> Sent: Saturday, February 07, 2009 3:05 AM
>>>>> To: Leonid Keller
>>>>> Cc: ofw at lists.openfabrics.org
>>>>> Subject: ibbus disable on HCA0 erroneously removes all IPoIB
>>>>> instances; including IPoIB ports on HCA1 ?
>>>>>
>>>>> Hello,
>>>>> Recently I discovered some bad HCA disable behavior which used
>>>>> to work correctly?
>>>>>
>>>>> Has the disable behavior for HCA0 been changed recently such that
>>>>> all existing IPoIB instances for all HCAs are removed?
>>>>>
>>>>> Details:
>>>>>
>>>>> For an x86 system using svn.1932 mthca.sys & ibbus.sys with two Mx
>>>>> MT23108 HCAs (1 port active, one port disconnected per HCA), no
>>>>> WSD or WinOF install, just bare mthca, ibbus & IPoIB.
>>>>>
>>>>> When both HCAs are enabled there are 4 IPoIB instances.
>>>>>
>>>>> When the 1st HCA as seen by PNP (HCA0 for discussion
>>>>> purposes) is disabled, all 4 IPoIB instances are removed from the
>>>>> device manager view along with the expected HCA0 disabled.
>>>>> The 2nd HCA (HCA1) is still enabled with no IPoIB instances shown
>>>>> by the device manager.
>>>>>
>>>>> The expected behavior when disabling HCA0 should be the 1st two
>>>>> IPoIB instances [0 & 2] would be removed from the device manager
>>>>> view, with the 2nd two IPoIB instances [3 & 4] remaining.
>>>>> This is the case which exposes the ibbus bug where vstat no longer
>>>>> works because \Devices\ibal has been removed as it's bound to the
>>>>> 1st PNP seen HCA which is now disabled.
>>>>>
>>>>> If you reverse the disable order, such that HCA1 is disabled while
>>>>> HCA0 remains enabled, the expected IPoIB instances [3 & 4] are
>>>>> removed; while instances [0 & 1] remain.
>>>>>
>>>>> The problem occurs when cl_pnp() calls
>>>>> ibbus::port_mgr_pnp_cb() to remove the IPoIB instances for HCA1;
>>>>> the previous call to ibbus::port_mgr_pnp_cb() for HCA0 is correct.
>>>>>
>>>>> fdo_query_remove() [
>>>>> IRP_MN_QUERY_REMOVE_DEVICE IB Bus @ FDO FAB160E8 refs(CI 0 AL 0)
>>>>> bfi-0 CA 8025000002c90200 fdo_query_remove() ]
>>>>> __query_remove() ]
>>>>> cl_pnp(): IrpSkip/IrpIgnore: skipping down to PDO 81DDD420, ext
>>>>> FAB160E8, status 0 cl_pnp(): returned with status 0
>>>>> cl_pnp() ]
>>>>> port_mgr_pnp_cb() [
>>>>> port_mgr_pnp_cb() ]
>>>>> port_mgr_pnp_cb() [
>>>>> port_mgr_port_remove() [
>>>>> bfi-0 ca_guid 0x8025000002c90200 port_num 1 port_mgr 81D46008
>>>>> port_mgr_port_remove(): Mark removing IPoIB: PDO 81DA3BD8, ext
>>>>> 81DA3C90, present 0, missing 0
>>>>> port_mgr_port_remove() ]
>>>>> port_mgr_pnp_cb() ]
>>>>> port_mgr_pnp_cb() [
>>>>> port_mgr_port_remove() [
>>>>> bfi-0 ca_guid 0x8025000002c90200 port_num 2 port_mgr 81D46008
>>>>> port_mgr_port_remove(): Mark removing IPoIB: PDO 81DA3350, ext
>>>>> 81DA3408, present 0, missing 0
>>>>> port_mgr_port_remove() ]
>>>>> port_mgr_pnp_cb() ]
>>>>> iou_mgr_pnp_cb() [
>>>>> iou_mgr_iou_remove() [
>>>>> bfi-0 ca_guid 0x8025000002c90200 iou_mgr FED74310
>>>>> iou_mgr_iou_remove(): bfi-0 IB IOU: ext FF58B7B8, present 0,
>>>>> missing 1 . iou_mgr_iou_remove() ] iou_mgr_pnp_cb() ]
>>>>>
>>>>> XXX - this PNP call for HCA1 should not of occurred when
>>>>> disabling HCA0.
>>>>>
>>>>> port_mgr_port_remove() [
>>>>> bfi-1 ca_guid 0xa425000002c90200 port_num 1 port_mgr 82030F40
>>>>> port_mgr_port_remove(): Mark removing IPoIB: PDO FED75DD8, ext
>>>>> FED75E90, present 0, missing 0
>>>>> port_mgr_port_remove() ]
>>>>> port_mgr_pnp_cb() ]
>>>>> port_mgr_pnp_cb() [
>>>>> port_mgr_port_remove() [
>>>>> bfi-1 ca_guid 0xa425000002c90200 port_num 2 port_mgr 82030F40
>>>>> port_mgr_port_remove(): Mark removing IPoIB: PDO FF881DD8, ext
>>>>> FF881E90, present 0, missing 0
>>>>> port_mgr_port_remove() ]
>>>>> port_mgr_pnp_cb() ]
>>>>> iou_mgr_pnp_cb() [
>>>>> iou_mgr_iou_remove() [
>>>>> bfi-1 ca_guid 0xa425000002c90200 iou_mgr 821A8A80
>>>>> iou_mgr_iou_remove(): bfi-1 IB IOU: ext FAC24620, present 0,
>>>>> missing 1 . iou_mgr_iou_remove() ] iou_mgr_pnp_cb() ]
>>>>>
>>>>> XXX end of badness...
>>>>>
>>>>> Any ideas on the reasons why the 2nd
>>> port_mgr_port_remove() call was
>>>>> invoked?
>>>>> Is there some binding between HCA1 IPoIB ports and HCA0?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Stan.
More information about the ofw
mailing list