[openib-general] Re: Mellanox device in INIT state
Shirley Ma
xma at us.ibm.com
Fri Sep 16 13:29:26 PDT 2005
I might hit this problem -- netdev reference counting problem with ib_at,
which was pointed out by Roland a week ago. The difference was I tried to
remove ib_mthca, not ib_ipoib. The process hung in the kernel and couldn't
recover.
The counter would go to -1 if bringing down the interface down first.
If loading both ib_at & ib_uat, when removing ipoib module without
bringing the interface down, the reference count is 2, with the interface
down, the reference is -3. ib_devs_changed() doesn't handle these events
correctly.
The workaround now is always removing ib_at/ib_uat before removing
ib_ipoib/ib_mthca.
Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638
Grant Grundler <iod00d at hp.com>
09/14/2005 04:17 PM
To
"Michael S. Tsirkin" <mst at mellanox.co.il>
cc
Grant Grundler <iod00d at hp.com>, Shirley Ma/Beaverton/IBM at IBMUS,
openib-general at openib.org
Subject
Re: Mellanox device in INIT state
On Wed, Sep 14, 2005 at 11:26:10AM +0300, Michael S. Tsirkin wrote:
> Seems to be a previous memory corruption that is biting us now.
> Looks like prot->rsk_prot isnt NULL, and prot->name seems to
> point to zeroed memory. Grant, is this reproducible?
Yes - I think so. At least SDP is generating a segfault/stack
trace to the console with it's loaded.
Now that I'm recording the failures, I'm not certain the previous
two failures were the same.
> If so, could you please try running with the following patch,
> and see what does it print?
yup
> MST
>
> Index: linux-2.6.13/drivers/infiniband/ulp/sdp/sdp_inet.c
> ===================================================================
> --- linux-2.6.13.orig/drivers/infiniband/ulp/sdp/sdp_inet.c 2005-09-11
12:36:48.000000000 +0300
> +++ linux-2.6.13/drivers/infiniband/ulp/sdp/sdp_inet.c 2005-09-14
13:14:35.000000000 +0300
> @@ -1321,6 +1321,11 @@ static int __init sdp_init(void)
>
> sdp_dbg_init("SDP module load.");
>
> + printk("sdp_sk_proto.name = %s\n", sdp_sk_proto.name);
> + printk("sdp_sk_proto.obj_size = %lld\n", (long
long)sdp_sk_proto.obj_size);
> + printk("sdp_init in_interrupt = %d\n", in_interrupt());
> + printk("sdp_init prot->rsk_prot = %p\n", prot->rsk_prot);
The last printk failed to compile:
vers/infiniband/ulp/sdp/sdp_inet.c:1327: error: 'proto' undeclared (first
use in this function)
I assume that was intended to be "sdp_sk_proto.rsk_prot".
Output follows - but with a different failure this time.
Something wierd is definitely going on.
gsyprf3:/usr/src/linux-2.6.13# reload_ib
+ IPoIB=51
+ ifconfig ib0 down
ib0: ERROR while getting interface flags: No such device
+ ifconfig ib1 down
ib1: ERROR while getting interface flags: No such device
+ rmmod ib_ipoib ib_uverbs ib_sdp ib_cm ib_sa ib_mthca ib_mad ib_core
ERROR: Module ib_ipoib does not exist in /proc/modules
ERROR: Module ib_uverbs does not exist in /proc/modules
ERROR: Module ib_sdp does not exist in /proc/modules
ERROR: Module ib_cm does not exist in /proc/modules
ERROR: Module ib_sa does not exist in /proc/modules
ERROR: Module ib_mthca does not exist in /proc/modules
ERROR: Module ib_mad does not exist in /proc/modules
ERROR: Module ib_core does not exist in /proc/modules
+ modprobe ib_mthca msi_x=1
ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005)
ib_mthca: Initializing ((¥)
GSI 60 (level, low) -> CPU 0 (0x0000) vector 69
ACPI: PCI Interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 69
(¥: Missing DCS, aborting.
ACPI: PCI interrupt for device 0000:81:00.0 disabled
GSI 60 (level, low) -> CPU 0 (0x0000) vector 69 unregistered
+ modprobe ib_ipoib
+ modprobe ib_sdp
sdp_sk_proto.name = SDP
sdp_sk_proto.obj_size = 1744
sdp_init in_interrupt = 0
sdp_init prot->rsk_prot = 0000000000000000
Uninitialised timer!
This is just a warning. Your computer is OK
function=0xa0000001008ac990, data=0xa00000020021b600
Call Trace:
[<a000000100012840>] show_stack+0x80/0xa0
sp=e000004041267c50 bsp=e000004041260fe0
[<a0000001000131d0>] dump_stack+0x30/0x60
sp=e000004041267e20 bsp=e000004041260fc8
[<a0000001000b6f80>] check_timer_failed+0xe0/0x120
sp=e000004041267e20 bsp=e000004041260fa8
[<a0000001000b86a0>] __mod_timer+0x60/0x200
sp=e000004041267e20 bsp=e000004041260f68
[<a0000001000cbfb0>] queue_delayed_work+0x110/0x1c0
sp=e000004041267e30 bsp=e000004041260f38
[<a00000020020bd00>] sdp_link_addr_init+0x1a0/0x3e0 [ib_sdp]
sp=e000004041267e30 bsp=e000004041260f10
[<a000000200138160>] sdp_init+0x160/0x900 [ib_sdp]
sp=e000004041267e30 bsp=e000004041260ee8
[<a0000001000e7ca0>] sys_init_module+0x2e0/0x680
sp=e000004041267e30 bsp=e000004041260e60
[<a00000010000b700>] ia64_ret_from_syscall+0x0/0x20
sp=e000004041267e30 bsp=e000004041260e60
[<a000000000010620>] __kernel_syscall_via_break+0x0/0x20
sp=e000004041268000 bsp=e000004041260e60
[ console hangs ]
I can't abort/interrupt the modprobe command and it's not segfaulting
this time. "ps -ef" shows (among other things):
grundler at gsyprf3:~$ ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 15:32 ? 00:00:04 init [2]
...
root 3972 2250 0 15:58 ttyS3 00:00:00 /bin/sh -x
/usr/local/bin/reload
root 3998 9 0 15:58 ? 00:00:00 [ipoib]
root 3999 3972 99 15:58 ttyS3 00:08:30 modprobe ib_sdp
root 4003 9 0 15:58 ? 00:00:00 [ib_cm/0]
root 4004 9 0 15:58 ? 00:00:00 [ib_cm/1]
root 4008 9 0 15:58 ? 00:00:00 [sdp_wq/0]
root 4009 9 0 15:58 ? 00:00:00 [sdp_wq/1]
...
grundler at gsyprf3:~$ ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.2 0.0 3440 1328 ? S 15:32 0:04 init [2]
...
root 3972 0.0 0.1 5584 2624 ttyS3 S+ 15:58 0:00 /bin/sh
-x /usr
root 3998 0.0 0.0 0 0 ? S< 15:58 0:00 [ipoib]
root 3999 99.9 0.2 6624 4592 ttyS3 R+ 15:58 9:50 modprobe
ib_sdp
root 4003 0.0 0.0 0 0 ? S< 15:58 0:00 [ib_cm/0]
root 4004 0.0 0.0 0 0 ? S< 15:58 0:00 [ib_cm/1]
root 4008 0.0 0.0 0 0 ? S< 15:58 0:00
[sdp_wq/0]
root 4009 0.0 0.0 0 0 ? S< 15:58 0:00
[sdp_wq/1]
...
"kill -9 3999" didn't have the intended effect either.
I'll rebuild with SDP_DEBUG options and see if that changes
it yet again.
grant
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050916/2b272cd4/attachment.html>
More information about the general
mailing list