[ofw] RE: ibbus disable on HCA0 erroneously removes all IPoIB instances; including IPoIB ports on HCA1 ?

Smith, Stan stan.smith at intel.com
Fri Feb 20 15:25:48 PST 2009


Hello Leonid,
  Thanks for taking the time to consider this curious ibbus/ipoib problem.

I understand what you are saying w.r.t. flows, although I'm confused in that disabling the 1st HCA used to only disable the 1st two IPoIB instances, not all 4. So I have to ask myself, what has changed to induce this new behavior? WHQL patches perhaps?

The previous disable HCA-0 behavior, granted it was incorrect for other reasons, disabled IPoIB instance 0 & 1, correctly leaving IPoIB instances 2 & 3 alone. The problem was the user accessible IBAL device was bound to HCA-0, hence IBAL access was disabled when HCA-0 was disabled, even though HCA-1 was alive and well; vstat stopped working.
Upon discovery we talked about implementing a Control Device Object which the user accessible IBAL device would be bound to, thus allowing HCAs to come and go (disable, enable) without breaking user-mode access to IBAL (provided there was at least one HCA enabled).

With the current HCA-0 disable behavior, a Control Device Object for the user-accessible IBAL device is not required as disabling either HCA device requires a reboot.

About your comment 'And the right solution is to remove this mechanism at all!', are you suggesting a conversion to KMDF PNP framework?

Thanks,

Stan.


Leonid Keller wrote:
>  > Any ideas on the reasons why the 2nd port_mgr_port_remove() call
> was invoked?
> To my guess, the problem is created by the HCA's mechanism of
> registration for IBAL arrival.
> And the right solution is to remove this mechanism at all!
>
> Here is my theory:
> To remind: IBAL historically was sitting under ROOT and could be
> loaded after HCA driver.
> So HCA driver made registration with OS on the arrival of IBAL low
> interface.
> And this code still works!
> What happens in this case ?
> Two HCA devices get started and make the above registration.
> When the first IBAL instance is started, both HCA devices get
> notification about it and register themselves with this instance.
> When we disable the first HCA, both HCAs get notification about the
> removing of IBAL (in __pnp_notify_ifc) and deregister HCA from IBAL,
> which remove all IPoIB devices.
>
> So we have now two flows of removing device, which works
> simultaneously.
> 1. The normal PnP flow, caused by IRP_MN_REMOVE_DEVICE:
> ibbus!cl_pnp
>         ibbus!__remove
>                 ibbus!cl_do_remove              // Pass the IRP down
>                         mlx4_hca!cl_pnp                 //
> IRP_MN_REMOVE_DEVICE
>                                 mlx4_hca!__remove
>                                         mlx4_hca!cl_do_remove
>
> mlx4_hca!hca_release_resources
>
> mlx4_hca!__hca_release_resources
>
> mlx4_hca!__hca_deregister
>
> ibbus!ib_deregister_ca
>
> 2. A flow, caused       by the notification on IBAL interface remove
>
> mlx4_hca!__pnp_notify_ifc
>         ibbus!ib_deregister_ca
>
> Just FYI: the flow of ib_deregister_ca:
> ib_deregister_ca
>         destroying_ci_ca                        // from
> p_ci_ca->obj.pfn_destroy
>                 sync_destroy_obj
>                         destroy_obj
>                                 destroying_ci_ca
>                                         pnp_ca_event( p_ci_ca,
> IB_PNP_CA_REMOVE );
>                                                 cl_async_proc_queue
> And in other thread:
>         __pnp_process_remove_ca
>                 __pnp_process_remove_port
>                         __pnp_notify_user
>                                 port_mgr_pnp_cb
>                                         port_mgr_port_remove
>
>
>> -----Original Message-----
>> From: Smith, Stan [mailto:stan.smith at intel.com]
>> Sent: Saturday, February 07, 2009 3:05 AM
>> To: Leonid Keller
>> Cc: ofw at lists.openfabrics.org
>> Subject: ibbus disable on HCA0 erroneously removes all IPoIB
>> instances; including IPoIB ports on HCA1 ?
>>
>> Hello,
>>   Recently I discovered some bad HCA disable behavior which
>> used to work correctly?
>>
>> Has the disable behavior for HCA0 been changed recently such
>> that all existing IPoIB instances for all HCAs are removed?
>>
>> Details:
>>
>> For an x86 system using svn.1932 mthca.sys & ibbus.sys with
>> two Mx MT23108 HCAs (1 port active, one port disconnected per
>> HCA), no WSD or WinOF install, just bare mthca, ibbus & IPoIB.
>>
>> When both HCAs are enabled there are 4 IPoIB instances.
>>
>> When the 1st HCA as seen by PNP (HCA0 for discussion
>> purposes) is disabled, all 4 IPoIB instances are removed from
>> the device manager view along with the expected HCA0 disabled.
>> The 2nd HCA (HCA1) is still enabled with no IPoIB instances shown by
>> the device manager.
>>
>> The expected behavior when disabling HCA0 should be the 1st
>> two IPoIB instances [0 & 2] would be removed from the device
>> manager view, with the 2nd two IPoIB instances [3 & 4] remaining.
>> This is the case which exposes the ibbus bug where vstat no
>> longer works because \Devices\ibal has been removed as it's
>> bound to the 1st PNP seen HCA which is now disabled.
>>
>> If you reverse the disable order, such that HCA1 is disabled
>> while HCA0 remains enabled, the expected IPoIB instances [3 &
>> 4] are removed; while instances [0 & 1] remain.
>>
>> The problem occurs when cl_pnp() calls
>> ibbus::port_mgr_pnp_cb() to remove the IPoIB instances for
>> HCA1; the previous call to ibbus::port_mgr_pnp_cb() for HCA0
>> is correct.
>>
>> fdo_query_remove() [
>> IRP_MN_QUERY_REMOVE_DEVICE IB Bus @ FDO FAB160E8 refs(CI 0 AL 0)
>> bfi-0 CA 8025000002c90200 fdo_query_remove() ]
>> __query_remove() ]
>> cl_pnp(): IrpSkip/IrpIgnore: skipping down to PDO 81DDD420,
>> ext FAB160E8, status 0
>> cl_pnp(): returned with status 0
>> cl_pnp() ]
>> port_mgr_pnp_cb() [
>> port_mgr_pnp_cb() ]
>> port_mgr_pnp_cb() [
>> port_mgr_port_remove() [
>> bfi-0 ca_guid 0x8025000002c90200 port_num 1 port_mgr 81D46008
>> port_mgr_port_remove(): Mark removing IPoIB: PDO 81DA3BD8,
>> ext 81DA3C90, present 0, missing 0
>> port_mgr_port_remove() ]
>> port_mgr_pnp_cb() ]
>> port_mgr_pnp_cb() [
>> port_mgr_port_remove() [
>> bfi-0 ca_guid 0x8025000002c90200 port_num 2 port_mgr 81D46008
>> port_mgr_port_remove(): Mark removing IPoIB: PDO 81DA3350,
>> ext 81DA3408, present 0, missing 0
>> port_mgr_port_remove() ]
>> port_mgr_pnp_cb() ]
>> iou_mgr_pnp_cb() [
>> iou_mgr_iou_remove() [
>> bfi-0 ca_guid 0x8025000002c90200 iou_mgr FED74310
>> iou_mgr_iou_remove(): bfi-0 IB IOU: ext FF58B7B8, present 0, missing
>> 1 . iou_mgr_iou_remove() ]
>> iou_mgr_pnp_cb() ]
>>
>> XXX - this PNP call for HCA1 should not of occurred when disabling
>> HCA0.
>>
>> port_mgr_port_remove() [
>> bfi-1 ca_guid 0xa425000002c90200 port_num 1 port_mgr 82030F40
>> port_mgr_port_remove(): Mark removing IPoIB: PDO FED75DD8,
>> ext FED75E90, present 0, missing 0
>> port_mgr_port_remove() ]
>> port_mgr_pnp_cb() ]
>> port_mgr_pnp_cb() [
>> port_mgr_port_remove() [
>> bfi-1 ca_guid 0xa425000002c90200 port_num 2 port_mgr 82030F40
>> port_mgr_port_remove(): Mark removing IPoIB: PDO FF881DD8,
>> ext FF881E90, present 0, missing 0
>> port_mgr_port_remove() ]
>> port_mgr_pnp_cb() ]
>> iou_mgr_pnp_cb() [
>> iou_mgr_iou_remove() [
>> bfi-1 ca_guid 0xa425000002c90200 iou_mgr 821A8A80
>> iou_mgr_iou_remove(): bfi-1 IB IOU: ext FAC24620, present 0, missing
>> 1 . iou_mgr_iou_remove() ]
>> iou_mgr_pnp_cb() ]
>>
>> XXX end of badness...
>>
>> Any ideas on the reasons why the 2nd port_mgr_port_remove()
>> call was invoked?
>> Is there some binding between HCA1 IPoIB ports and HCA0?
>>
>> Thanks,
>>
>> Stan.




More information about the ofw mailing list