<br><font size=2 face="sans-serif">I might hit this problem -- netdev reference
counting problem with ib_at, which was pointed out by Roland a week ago.
The difference was I tried to remove ib_mthca, not ib_ipoib. The process
hung in the kernel and couldn't recover. </font>
<br>
<br><font size=2 face="sans-serif">The counter would go to -1 if bringing
down the interface down first.<br>
</font>
<br><font size=2 face="sans-serif">If loading both ib_at & ib_uat,
when removing ipoib module without bringing the interface down, the reference
count is 2, with the interface down, the reference is -3. ib_devs_changed()
doesn't handle these events correctly.</font>
<br>
<br><font size=2 face="sans-serif">The workaround now is always removing
ib_at/ib_uat before removing ib_ipoib/ib_mthca.</font>
<br>
<br><font size=2 face="sans-serif">Thanks<br>
Shirley Ma<br>
IBM Linux Technology Center<br>
15300 SW Koll Parkway<br>
Beaverton, OR 97006-6063<br>
Phone(Fax): (503) 578-7638<br>
<br>
</font>
<br>
<br>
<br>
<table width=100%>
<tr valign=top>
<td width=40%><font size=1 face="sans-serif"><b>Grant Grundler <iod00d@hp.com></b>
</font>
<p><font size=1 face="sans-serif">09/14/2005 04:17 PM</font>
<td width=59%>
<table width=100%>
<tr>
<td>
<div align=right><font size=1 face="sans-serif">To</font></div>
<td valign=top><font size=1 face="sans-serif">"Michael S. Tsirkin"
<mst@mellanox.co.il></font>
<tr>
<td>
<div align=right><font size=1 face="sans-serif">cc</font></div>
<td valign=top><font size=1 face="sans-serif">Grant Grundler <iod00d@hp.com>,
Shirley Ma/Beaverton/IBM@IBMUS, openib-general@openib.org</font>
<tr>
<td>
<div align=right><font size=1 face="sans-serif">Subject</font></div>
<td valign=top><font size=1 face="sans-serif">Re: Mellanox device in INIT
state</font></table>
<br>
<table>
<tr valign=top>
<td>
<td></table>
<br></table>
<br>
<br>
<br><font size=2><tt>On Wed, Sep 14, 2005 at 11:26:10AM +0300, Michael
S. Tsirkin wrote:<br>
> Seems to be a previous memory corruption that is biting us now.<br>
> Looks like prot->rsk_prot isnt NULL, and prot->name seems to
<br>
> point to zeroed memory. Grant, is this reproducible?<br>
<br>
Yes - I think so. At least SDP is generating a segfault/stack<br>
trace to the console with it's loaded.<br>
<br>
Now that I'm recording the failures, I'm not certain the previous<br>
two failures were the same.<br>
<br>
> If so, could you please try running with the following patch,<br>
> and see what does it print?<br>
<br>
yup<br>
<br>
> MST<br>
> <br>
> Index: linux-2.6.13/drivers/infiniband/ulp/sdp/sdp_inet.c<br>
> ===================================================================<br>
> --- linux-2.6.13.orig/drivers/infiniband/ulp/sdp/sdp_inet.c
2005-09-11
12:36:48.000000000 +0300<br>
> +++ linux-2.6.13/drivers/infiniband/ulp/sdp/sdp_inet.c
2005-09-14 13:14:35.000000000
+0300<br>
> @@ -1321,6 +1321,11 @@ static int __init sdp_init(void)<br>
> <br>
>
sdp_dbg_init("SDP module load.");<br>
> <br>
> +
printk("sdp_sk_proto.name = %s\n", sdp_sk_proto.name);<br>
> +
printk("sdp_sk_proto.obj_size = %lld\n", (long long)sdp_sk_proto.obj_size);<br>
> +
printk("sdp_init in_interrupt = %d\n", in_interrupt());<br>
> +
printk("sdp_init prot->rsk_prot = %p\n", prot->rsk_prot);<br>
<br>
The last printk failed to compile:<br>
vers/infiniband/ulp/sdp/sdp_inet.c:1327: error: 'proto' undeclared (first
use in this function)<br>
<br>
I assume that was intended to be "sdp_sk_proto.rsk_prot".<br>
<br>
Output follows - but with a different failure this time.<br>
Something wierd is definitely going on.<br>
<br>
gsyprf3:/usr/src/linux-2.6.13# reload_ib <br>
+ IPoIB=51<br>
+ ifconfig ib0 down<br>
ib0: ERROR while getting interface flags: No such device<br>
+ ifconfig ib1 down<br>
ib1: ERROR while getting interface flags: No such device<br>
+ rmmod ib_ipoib ib_uverbs ib_sdp ib_cm ib_sa ib_mthca ib_mad ib_core<br>
ERROR: Module ib_ipoib does not exist in /proc/modules<br>
ERROR: Module ib_uverbs does not exist in /proc/modules<br>
ERROR: Module ib_sdp does not exist in /proc/modules<br>
ERROR: Module ib_cm does not exist in /proc/modules<br>
ERROR: Module ib_sa does not exist in /proc/modules<br>
ERROR: Module ib_mthca does not exist in /proc/modules<br>
ERROR: Module ib_mad does not exist in /proc/modules<br>
ERROR: Module ib_core does not exist in /proc/modules<br>
+ modprobe ib_mthca msi_x=1<br>
ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005)<br>
ib_mthca: Initializing ((¥)<br>
GSI 60 (level, low) -> CPU 0 (0x0000) vector 69<br>
ACPI: PCI Interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ
69<br>
(¥: Missing DCS, aborting.<br>
ACPI: PCI interrupt for device 0000:81:00.0 disabled<br>
GSI 60 (level, low) -> CPU 0 (0x0000) vector 69 unregistered<br>
+ modprobe ib_ipoib<br>
+ modprobe ib_sdp<br>
sdp_sk_proto.name = SDP<br>
sdp_sk_proto.obj_size = 1744<br>
sdp_init in_interrupt = 0<br>
sdp_init prot->rsk_prot = 0000000000000000<br>
Uninitialised timer!<br>
This is just a warning. Your computer is OK<br>
function=0xa0000001008ac990, data=0xa00000020021b600<br>
<br>
Call Trace:<br>
[<a000000100012840>] show_stack+0x80/0xa0<br>
sp=e000004041267c50 bsp=e000004041260fe0<br>
[<a0000001000131d0>] dump_stack+0x30/0x60<br>
sp=e000004041267e20 bsp=e000004041260fc8<br>
[<a0000001000b6f80>] check_timer_failed+0xe0/0x120<br>
sp=e000004041267e20 bsp=e000004041260fa8<br>
[<a0000001000b86a0>] __mod_timer+0x60/0x200<br>
sp=e000004041267e20 bsp=e000004041260f68<br>
[<a0000001000cbfb0>] queue_delayed_work+0x110/0x1c0<br>
sp=e000004041267e30 bsp=e000004041260f38<br>
[<a00000020020bd00>] sdp_link_addr_init+0x1a0/0x3e0 [ib_sdp]<br>
sp=e000004041267e30 bsp=e000004041260f10<br>
[<a000000200138160>] sdp_init+0x160/0x900 [ib_sdp]<br>
sp=e000004041267e30 bsp=e000004041260ee8<br>
[<a0000001000e7ca0>] sys_init_module+0x2e0/0x680<br>
sp=e000004041267e30 bsp=e000004041260e60<br>
[<a00000010000b700>] ia64_ret_from_syscall+0x0/0x20<br>
sp=e000004041267e30 bsp=e000004041260e60<br>
[<a000000000010620>] __kernel_syscall_via_break+0x0/0x20<br>
sp=e000004041268000 bsp=e000004041260e60<br>
<br>
[ console hangs ]<br>
<br>
I can't abort/interrupt the modprobe command and it's not segfaulting<br>
this time. "ps -ef" shows (among other things):<br>
grundler@gsyprf3:~$ ps -ef <br>
UID PID PPID C STIME TTY
TIME CMD<br>
root 1 0 0 15:32 ?
00:00:04 init [2] <br>
...<br>
root 3972 2250 0 15:58 ttyS3 00:00:00
/bin/sh -x /usr/local/bin/reload<br>
root 3998 9 0 15:58 ?
00:00:00 [ipoib]<br>
root 3999 3972 99 15:58 ttyS3 00:08:30
modprobe ib_sdp<br>
root 4003 9 0 15:58 ?
00:00:00 [ib_cm/0]<br>
root 4004 9 0 15:58 ?
00:00:00 [ib_cm/1]<br>
root 4008 9 0 15:58 ?
00:00:00 [sdp_wq/0]<br>
root 4009 9 0 15:58 ?
00:00:00 [sdp_wq/1]<br>
...<br>
<br>
grundler@gsyprf3:~$ ps aux<br>
USER PID %CPU %MEM VSZ RSS TTY
STAT START TIME COMMAND<br>
root 1 0.2 0.0 3440 1328
? S 15:32 0:04 init [2]
<br>
...<br>
root 3972 0.0 0.1 5584 2624
ttyS3 S+ 15:58 0:00 /bin/sh -x /usr<br>
root 3998 0.0 0.0 0
0 ? S< 15:58
0:00 [ipoib]<br>
root 3999 99.9 0.2 6624 4592 ttyS3
R+ 15:58 9:50 modprobe ib_sdp<br>
root 4003 0.0 0.0 0
0 ? S< 15:58
0:00 [ib_cm/0]<br>
root 4004 0.0 0.0 0
0 ? S< 15:58
0:00 [ib_cm/1]<br>
root 4008 0.0 0.0 0
0 ? S< 15:58
0:00 [sdp_wq/0]<br>
root 4009 0.0 0.0 0
0 ? S< 15:58
0:00 [sdp_wq/1]<br>
...<br>
<br>
"kill -9 3999" didn't have the intended effect either.<br>
<br>
<br>
I'll rebuild with SDP_DEBUG options and see if that changes<br>
it yet again.<br>
<br>
grant<br>
</tt></font>
<br>