[ofw] RE: ibbus disable on HCA0 erroneously removes all IPoIB instances; including IPoIB ports on HCA1 ?

Leonid Keller leonid at mellanox.co.il
Tue Feb 17 10:31:08 PST 2009


 > Any ideas on the reasons why the 2nd port_mgr_port_remove() call was
invoked?
To my guess, the problem is created by the HCA's mechanism of
registration for IBAL arrival.
And the right solution is to remove this mechanism at all!

Here is my theory:
To remind: IBAL historically was sitting under ROOT and could be loaded
after HCA driver.
So HCA driver made registration with OS on the arrival of IBAL low
interface. 
And this code still works!
What happens in this case ?
Two HCA devices get started and make the above registration.
When the first IBAL instance is started, both HCA devices get
notification about it and register themselves with this instance.
When we disable the first HCA, both HCAs get notification about the
removing of IBAL (in __pnp_notify_ifc) and deregister HCA from IBAL,
which remove all IPoIB devices.

So we have now two flows of removing device, which works simultaneously.
1. The normal PnP flow, caused by IRP_MN_REMOVE_DEVICE:
ibbus!cl_pnp		 
	ibbus!__remove
		ibbus!cl_do_remove		// Pass the IRP down
			mlx4_hca!cl_pnp			//
IRP_MN_REMOVE_DEVICE
				mlx4_hca!__remove
					mlx4_hca!cl_do_remove
	
mlx4_hca!hca_release_resources
	
mlx4_hca!__hca_release_resources
	
mlx4_hca!__hca_deregister
	
ibbus!ib_deregister_ca

2. A flow, caused	by the notification on IBAL interface remove

mlx4_hca!__pnp_notify_ifc
	ibbus!ib_deregister_ca

Just FYI: the flow of ib_deregister_ca:
ib_deregister_ca
	destroying_ci_ca			// from
p_ci_ca->obj.pfn_destroy
		sync_destroy_obj
			destroy_obj
				destroying_ci_ca
					pnp_ca_event( p_ci_ca,
IB_PNP_CA_REMOVE );
						cl_async_proc_queue
And in other thread:
	__pnp_process_remove_ca		
		__pnp_process_remove_port
			__pnp_notify_user
				port_mgr_pnp_cb
					port_mgr_port_remove


> -----Original Message-----
> From: Smith, Stan [mailto:stan.smith at intel.com] 
> Sent: Saturday, February 07, 2009 3:05 AM
> To: Leonid Keller
> Cc: ofw at lists.openfabrics.org
> Subject: ibbus disable on HCA0 erroneously removes all IPoIB 
> instances; including IPoIB ports on HCA1 ?
> 
> Hello,
>   Recently I discovered some bad HCA disable behavior which 
> used to work correctly?
> 
> Has the disable behavior for HCA0 been changed recently such 
> that all existing IPoIB instances for all HCAs are removed?
> 
> Details:
> 
> For an x86 system using svn.1932 mthca.sys & ibbus.sys with 
> two Mx MT23108 HCAs (1 port active, one port disconnected per 
> HCA), no WSD or WinOF install, just bare mthca, ibbus & IPoIB.
> 
> When both HCAs are enabled there are 4 IPoIB instances.
> 
> When the 1st HCA as seen by PNP (HCA0 for discussion 
> purposes) is disabled, all 4 IPoIB instances are removed from 
> the device manager view along with the expected HCA0 disabled.
> The 2nd HCA (HCA1) is still enabled with no IPoIB instances 
> shown by the device manager.
> 
> The expected behavior when disabling HCA0 should be the 1st 
> two IPoIB instances [0 & 2] would be removed from the device 
> manager view, with the 2nd two IPoIB instances [3 & 4] remaining.
> This is the case which exposes the ibbus bug where vstat no 
> longer works because \Devices\ibal has been removed as it's 
> bound to the 1st PNP seen HCA which is now disabled.
> 
> If you reverse the disable order, such that HCA1 is disabled 
> while HCA0 remains enabled, the expected IPoIB instances [3 & 
> 4] are removed; while instances [0 & 1] remain.
> 
> The problem occurs when cl_pnp() calls 
> ibbus::port_mgr_pnp_cb() to remove the IPoIB instances for 
> HCA1; the previous call to ibbus::port_mgr_pnp_cb() for HCA0 
> is correct.
> 
> fdo_query_remove() [
> IRP_MN_QUERY_REMOVE_DEVICE IB Bus @ FDO FAB160E8 refs(CI 0 AL 0)
>    bfi-0 CA 8025000002c90200
> fdo_query_remove() ]
> __query_remove() ]
> cl_pnp(): IrpSkip/IrpIgnore: skipping down to PDO 81DDD420, 
> ext FAB160E8, status 0
> cl_pnp(): returned with status 0
> cl_pnp() ]
> port_mgr_pnp_cb() [
> port_mgr_pnp_cb() ]
> port_mgr_pnp_cb() [
> port_mgr_port_remove() [
> bfi-0 ca_guid 0x8025000002c90200 port_num 1 port_mgr 81D46008
> port_mgr_port_remove(): Mark removing IPoIB: PDO 81DA3BD8, 
> ext 81DA3C90, present 0, missing 0
> port_mgr_port_remove() ]
> port_mgr_pnp_cb() ]
> port_mgr_pnp_cb() [
> port_mgr_port_remove() [
> bfi-0 ca_guid 0x8025000002c90200 port_num 2 port_mgr 81D46008
> port_mgr_port_remove(): Mark removing IPoIB: PDO 81DA3350, 
> ext 81DA3408, present 0, missing 0
> port_mgr_port_remove() ]
> port_mgr_pnp_cb() ]
> iou_mgr_pnp_cb() [
> iou_mgr_iou_remove() [
> bfi-0 ca_guid 0x8025000002c90200 iou_mgr FED74310
> iou_mgr_iou_remove(): bfi-0 IB IOU: ext FF58B7B8, present 0, 
> missing 1 .
> iou_mgr_iou_remove() ]
> iou_mgr_pnp_cb() ]
> 
> XXX - this PNP call for HCA1 should not of occurred when 
> disabling HCA0.
> 
> port_mgr_port_remove() [
> bfi-1 ca_guid 0xa425000002c90200 port_num 1 port_mgr 82030F40
> port_mgr_port_remove(): Mark removing IPoIB: PDO FED75DD8, 
> ext FED75E90, present 0, missing 0
> port_mgr_port_remove() ]
> port_mgr_pnp_cb() ]
> port_mgr_pnp_cb() [
> port_mgr_port_remove() [
> bfi-1 ca_guid 0xa425000002c90200 port_num 2 port_mgr 82030F40
> port_mgr_port_remove(): Mark removing IPoIB: PDO FF881DD8, 
> ext FF881E90, present 0, missing 0
> port_mgr_port_remove() ]
> port_mgr_pnp_cb() ]
> iou_mgr_pnp_cb() [
> iou_mgr_iou_remove() [
> bfi-1 ca_guid 0xa425000002c90200 iou_mgr 821A8A80
> iou_mgr_iou_remove(): bfi-1 IB IOU: ext FAC24620, present 0, 
> missing 1 .
> iou_mgr_iou_remove() ]
> iou_mgr_pnp_cb() ]
> 
> XXX end of badness...
> 
> Any ideas on the reasons why the 2nd port_mgr_port_remove() 
> call was invoked?
> Is there some binding between HCA1 IPoIB ports and HCA0?
> 
> Thanks,
> 
> Stan.
> 



More information about the ofw mailing list