[ofw] issue with checkin# 3122
Leonid Keller
leonid at mellanox.co.il
Mon Jul 25 09:58:17 PDT 2011
SB
From: Smith, Stan [mailto:stan.smith at intel.com]
Sent: Monday, July 25, 2011 7:46 PM
To: Leonid Keller; Uri Habusha
Cc: ofw at lists.openfabrics.org; Tzachi Dar; Gilad Margalit; Benyahu Mizrahi
Subject: RE: issue with checkin# 3122
Thanks Leo; a quick clarification.
Uri claimed he disabled the mlx4_bus driver and you stated
[Leonid Keller] It was upon power-down IRP. The below code was called ultimately from __device_power_down_workItem():
__device_power_down_workItem
__deregister_ca
...
Is that to say the sequence of events was:
1) Disable mlx_bus driver
2) Start system shutdown?
Or was the device power down workItem invoked due to the mlx4_bus disable?
Please clarify.
[Leonid Keller] There was no manual actions like driver disable. It happened during regression run.
I guess, regression script installed the drivers and then tried to restart the computer.
The CL_ASSERT() has been in the source base for over two years.
My point being I strongly suspect the CL_ASSERT() would have fired prior to commit 3122, given the circumstances (Svr2008 R2, check ibbus driver, ioc_poll_interval=0).
We all understand there are existing race conditions in the IB stack w.r.t. power-down and/or device disable.
None the less, I will look into the situation.
Stan.
From: Leonid Keller [mailto:leonid at mellanox.co.il]
Sent: Sunday, July 24, 2011 5:36 AM
To: Uri Habusha; Smith, Stan
Cc: ofw at lists.openfabrics.org; Tzachi Dar; Gilad Margalit; Benyahu Mizrahi
Subject: RE: issue with checkin# 3122
SB
From: Smith, Stan [mailto:stan.smith at intel.com]
Sent: Friday, July 22, 2011 1:23 AM
To: Uri Habusha
Cc: ofw at lists.openfabrics.org; Leonid Keller; Tzachi Dar; Gilad Margalit; Benyahu Mizrahi
Subject: RE: issue with checkin# 3122
Thanks for the note, I'll look into ASAP.
Some questions:
When you mention low-level driver you are speaking of the HCA driver?
[Leonid Keller] It was upon power-down IRP. The below code was called ultimately from __device_power_down_workItem():
__device_power_down_workItem
__deregister_ca
...
Does the CL_ASSERT() fire if the registry entry ioc_poll_interval == 30000?
[Leonid Keller] Dunno. We have ioc_poll_interval = 0
Is the system which CL_ASSERT()'ed attached to a fabric which has an IB target device (IOC/IOU)?
[Leonid Keller] No
What user-level events occurred before the HCA was disabled? Anything interesting? Perhaps a 'devcon rescan'?
[Leonid Keller] Nothing. The driver was installed and then, I believe, the machine was restarted.
Maybe you can reproduce it by using pwrtest.exe from WDK.
If p_ctx is NULL then the port object has been destroyed. The issue, I'm guessing at this point, might be a race between PNP and port_destroy()?
A missing lock call perhaps?
Will let you know.
Stan.
From: Uri Habusha [mailto:urih at mellanox.co.il]
Sent: Thursday, July 21, 2011 4:49 AM
To: Smith, Stan
Cc: ofw at lists.openfabrics.org; Leonid Keller; Tzachi Dar; Gilad Margalit; Benyahu Mizrahi
Subject: issue with checkin# 3122
Hi Stan,
I adopted your checkin# 3122 - IOC poll on demand.
When disabling the drive an ASSERT is popup. The ASSERT is in following code in port_mgr_pnp_cb function
CL_ASSERT( p_ctx ); <== The problematic assert
if (p_ctx)
{
p_bfi = p_ctx->p_bus_filter;
CL_ASSERT( p_bfi );
if (p_bfi->p_port_mgr->active_ports > 0)
cl_atomic_dec( &p_bfi->p_port_mgr->active_ports );
}
port_mgr_port_remove( (ib_pnp_port_rec_t*)p_pnp_rec );
break;
I noticed that in the port_mgr_port_remove there is a check if the ctx is valid or not. So I guess it's a known issue that can happen. For now I removed the assert in our code.
Please take a look in the code and see if it's valid fix (if so please change ofw code accordingly) or debug the issue. It happens when disable \enable the low level driver.
Uri
Uri Habusha
Windows SW Development Lead
Mellanox Technologies
P.OBox 586, Yokneam 20692
Israel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20110725/157a0898/attachment.html>
More information about the ofw
mailing list