[openib-general] Re: Mellanox device in INIT state

Shirley Ma xma at us.ibm.com
Fri Sep 16 13:29:26 PDT 2005


I might hit this problem -- netdev reference counting problem with ib_at, 
which was pointed out by Roland a week ago. The difference was I tried to 
remove ib_mthca, not ib_ipoib. The process hung in the kernel and couldn't 
recover. 

The counter would go to -1 if bringing down the interface down first.

If loading both ib_at & ib_uat, when removing ipoib module without 
bringing the interface down, the reference count is 2, with the interface 
down, the reference is -3. ib_devs_changed() doesn't handle these events 
correctly.

The workaround now is always removing ib_at/ib_uat before removing 
ib_ipoib/ib_mthca.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638





Grant Grundler <iod00d at hp.com> 
09/14/2005 04:17 PM

To
"Michael S. Tsirkin" <mst at mellanox.co.il>
cc
Grant Grundler <iod00d at hp.com>, Shirley Ma/Beaverton/IBM at IBMUS, 
openib-general at openib.org
Subject
Re: Mellanox device in INIT state






On Wed, Sep 14, 2005 at 11:26:10AM +0300, Michael S. Tsirkin wrote:
> Seems to be a previous memory corruption that is biting us now.
> Looks like prot->rsk_prot isnt NULL, and prot->name seems to 
> point to zeroed memory. Grant, is this reproducible?

Yes - I think so. At least SDP is generating a segfault/stack
trace to the console with it's loaded.

Now that I'm recording the failures, I'm not certain the previous
two failures were the same.

> If so, could you please try running with the following patch,
> and see what does it print?

yup

> MST
> 
> Index: linux-2.6.13/drivers/infiniband/ulp/sdp/sdp_inet.c
> ===================================================================
> --- linux-2.6.13.orig/drivers/infiniband/ulp/sdp/sdp_inet.c 2005-09-11 
12:36:48.000000000 +0300
> +++ linux-2.6.13/drivers/infiniband/ulp/sdp/sdp_inet.c 2005-09-14 
13:14:35.000000000 +0300
> @@ -1321,6 +1321,11 @@ static int __init sdp_init(void)
> 
>                sdp_dbg_init("SDP module load.");
> 
> +              printk("sdp_sk_proto.name = %s\n", sdp_sk_proto.name);
> +              printk("sdp_sk_proto.obj_size = %lld\n", (long 
long)sdp_sk_proto.obj_size);
> +              printk("sdp_init in_interrupt = %d\n", in_interrupt());
> +              printk("sdp_init prot->rsk_prot = %p\n", prot->rsk_prot);

The last printk failed to compile:
vers/infiniband/ulp/sdp/sdp_inet.c:1327: error: 'proto' undeclared (first 
use in this function)

I assume that was intended to be "sdp_sk_proto.rsk_prot".

Output follows - but with a different failure this time.
Something wierd is definitely going on.

gsyprf3:/usr/src/linux-2.6.13# reload_ib 
+ IPoIB=51
+ ifconfig ib0 down
ib0: ERROR while getting interface flags: No such device
+ ifconfig ib1 down
ib1: ERROR while getting interface flags: No such device
+ rmmod ib_ipoib ib_uverbs ib_sdp ib_cm ib_sa ib_mthca ib_mad ib_core
ERROR: Module ib_ipoib does not exist in /proc/modules
ERROR: Module ib_uverbs does not exist in /proc/modules
ERROR: Module ib_sdp does not exist in /proc/modules
ERROR: Module ib_cm does not exist in /proc/modules
ERROR: Module ib_sa does not exist in /proc/modules
ERROR: Module ib_mthca does not exist in /proc/modules
ERROR: Module ib_mad does not exist in /proc/modules
ERROR: Module ib_core does not exist in /proc/modules
+ modprobe ib_mthca msi_x=1
ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005)
ib_mthca: Initializing  ((¥)
GSI 60 (level, low) -> CPU 0 (0x0000) vector 69
ACPI: PCI Interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 69
 (¥: Missing DCS, aborting.
ACPI: PCI interrupt for device 0000:81:00.0 disabled
GSI 60 (level, low) -> CPU 0 (0x0000) vector 69 unregistered
+ modprobe ib_ipoib
+ modprobe ib_sdp
sdp_sk_proto.name = SDP
sdp_sk_proto.obj_size = 1744
sdp_init in_interrupt = 0
sdp_init prot->rsk_prot = 0000000000000000
Uninitialised timer!
This is just a warning.  Your computer is OK
function=0xa0000001008ac990, data=0xa00000020021b600

Call Trace:
 [<a000000100012840>] show_stack+0x80/0xa0
                                sp=e000004041267c50 bsp=e000004041260fe0
 [<a0000001000131d0>] dump_stack+0x30/0x60
                                sp=e000004041267e20 bsp=e000004041260fc8
 [<a0000001000b6f80>] check_timer_failed+0xe0/0x120
                                sp=e000004041267e20 bsp=e000004041260fa8
 [<a0000001000b86a0>] __mod_timer+0x60/0x200
                                sp=e000004041267e20 bsp=e000004041260f68
 [<a0000001000cbfb0>] queue_delayed_work+0x110/0x1c0
                                sp=e000004041267e30 bsp=e000004041260f38
 [<a00000020020bd00>] sdp_link_addr_init+0x1a0/0x3e0 [ib_sdp]
                                sp=e000004041267e30 bsp=e000004041260f10
 [<a000000200138160>] sdp_init+0x160/0x900 [ib_sdp]
                                sp=e000004041267e30 bsp=e000004041260ee8
 [<a0000001000e7ca0>] sys_init_module+0x2e0/0x680
                                sp=e000004041267e30 bsp=e000004041260e60
 [<a00000010000b700>] ia64_ret_from_syscall+0x0/0x20
                                sp=e000004041267e30 bsp=e000004041260e60
 [<a000000000010620>] __kernel_syscall_via_break+0x0/0x20
                                sp=e000004041268000 bsp=e000004041260e60

[ console hangs ]

I can't abort/interrupt the modprobe command and it's not segfaulting
this time. "ps -ef" shows (among other things):
grundler at gsyprf3:~$ ps -ef 
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 15:32 ?        00:00:04 init [2] 
...
root      3972  2250  0 15:58 ttyS3    00:00:00 /bin/sh -x 
/usr/local/bin/reload
root      3998     9  0 15:58 ?        00:00:00 [ipoib]
root      3999  3972 99 15:58 ttyS3    00:08:30 modprobe ib_sdp
root      4003     9  0 15:58 ?        00:00:00 [ib_cm/0]
root      4004     9  0 15:58 ?        00:00:00 [ib_cm/1]
root      4008     9  0 15:58 ?        00:00:00 [sdp_wq/0]
root      4009     9  0 15:58 ?        00:00:00 [sdp_wq/1]
...

grundler at gsyprf3:~$ ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.2  0.0   3440  1328 ?        S    15:32   0:04 init [2] 
...
root      3972  0.0  0.1   5584  2624 ttyS3    S+   15:58   0:00 /bin/sh 
-x /usr
root      3998  0.0  0.0      0     0 ?        S<   15:58   0:00 [ipoib]
root      3999 99.9  0.2   6624  4592 ttyS3    R+   15:58   9:50 modprobe 
ib_sdp
root      4003  0.0  0.0      0     0 ?        S<   15:58   0:00 [ib_cm/0]
root      4004  0.0  0.0      0     0 ?        S<   15:58   0:00 [ib_cm/1]
root      4008  0.0  0.0      0     0 ?        S<   15:58   0:00 
[sdp_wq/0]
root      4009  0.0  0.0      0     0 ?        S<   15:58   0:00 
[sdp_wq/1]
...

"kill -9 3999" didn't have the intended effect either.


I'll rebuild with SDP_DEBUG options and see if that changes
it yet again.

grant

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050916/2b272cd4/attachment.html>


More information about the general mailing list