[ofa-general] Re: InfiniBand card (mthca) in Linux

Lukas Hejtmanek xhejtman at ics.muni.cz
Sat Jul 7 17:15:32 PDT 2007


On Sat, Jul 07, 2007 at 04:24:16PM -0700, Roland Dreier wrote:
> But I don't understand how we could have reached mthca_alloc_icm()
> without getting through mthca_QUERY_FW and printing the FW version
> first... are you sure you're getting all the trace messages?  How are
> you collecting them?  Can you make sure that your console level is set
> so that you see messages printed with KERN_DEBUG?

You are right, the console did not receive debug messages so I changed
mthca_dbg to spam with KERN_ERR priority instead. (This time, it looks like 
corruption gets triggered at another place and the driver complains to not 
receive IRQ).

Here is the result:
# insmod ib_mthca.ko debug_level=1
ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006)
ib_mthca: Initializing 0000:08:00.0
PCI: Enabling device 0000:08:00.0 (0000 -> 0002)
ib_mthca 0000:08:00.0: FW version 000100020000, max commands 16
ib_mthca 0000:08:00.0: Catastrophic error buffer at 0xb9382a50, size 0x10
ib_mthca 0000:08:00.0: FW supports commands through doorbells
ib_mthca 0000:08:00.0: Mapped doorbell page for posting FW commands
ib_mthca 0000:08:00.0: FW size 5136 KB
ib_mthca 0000:08:00.0: Clear int § b93f00d8, EQ arm § b9361748, EQ set CI § b9372000
ib_mthca 0000:08:00.0: No HCA-attached memory (running in MemFree mode)
ib_mthca 0000:08:00.0: Mapped 1284 chunks/5136 KB for FW.
ib_mthca 0000:08:00.0: Base MM extensions: no
ib_mthca 0000:08:00.0: Max ICM size 523264 MB
ib_mthca 0000:08:00.0: Max QPs: 16777216, reserved QPs: 1024, entry size: 256
ib_mthca 0000:08:00.0: Max SRQs: 1024, reserved SRQs: 64, entry size: 32
ib_mthca 0000:08:00.0: Max CQs: 16777216, reserved CQs: 128, entry size: 64
ib_mthca 0000:08:00.0: Max EQs: 64, reserved EQs: 1, entry size: 64
ib_mthca 0000:08:00.0: reserved MPTs: 16, reserved MTTs: 2
ib_mthca 0000:08:00.0: Max PDs: 8388608, reserved PDs: 4, reserved UARs: 1
ib_mthca 0000:08:00.0: Max QP/MCG: 8388608, reserved MGMs: 0
ib_mthca 0000:08:00.0: Max CQEs: 131072, max WQEs: 16384, max SRQ WQEs: 16384
ib_mthca 0000:08:00.0: Flags: 00370347
ib_mthca 0000:08:00.0: profile 0--13/11 § 0x               0 (size 0x20000000)
ib_mthca 0000:08:00.0: profile 1--10/20 § 0x        20000000 (size 0x 4000000)
ib_mthca 0000:08:00.0: profile 2-- 0/16 § 0x        24000000 (size 0x 1000000)
ib_mthca 0000:08:00.0: profile 3-- 7/18 § 0x        25000000 (size 0x  800000)
ib_mthca 0000:08:00.0: profile 4-- 9/17 § 0x        25800000 (size 0x  800000)
ib_mthca 0000:08:00.0: profile 5-- 3/16 § 0x        26000000 (size 0x  400000)
ib_mthca 0000:08:00.0: profile 6-- 4/16 § 0x        26400000 (size 0x  400000)
ib_mthca 0000:08:00.0: profile 7-- 8/13 § 0x        26800000 (size 0x   80000)
ib_mthca 0000:08:00.0: profile 8--11/11 § 0x        26880000 (size 0x   10000)
ib_mthca 0000:08:00.0: profile 9-- 2/10 § 0x        26890000 (size 0x    8000)
ib_mthca 0000:08:00.0: profile10-- 1/ 0 § 0x        26898000 (size 0x    1000)
ib_mthca 0000:08:00.0: profile11-- 5/ 0 § 0x        26899000 (size 0x    1000)
ib_mthca 0000:08:00.0: profile12-- 6/ 5 § 0x        2689a000 (size 0x    1000)
ib_mthca 0000:08:00.0: profile13--12/ 0 § 0x        2689b000 (size 0x    1000)
ib_mthca 0000:08:00.0: HCA context memory: reserving 631408 KB
ib_mthca 0000:08:00.0: 631408 KB of HCA context requires 1244 KB aux memory.
ib_mthca 0000:08:00.0: Mapped 311 chunks/1244 KB for ICM aux.
ib_mthca 0000:08:00.0: Mapped page at 24d8b000 to 2689a000 for ICM.
ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 20000000 for ICM.
ib_mthca 0000:08:00.0: Mapped 1 chunks/256 KB at 25800000 for ICM.
ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 24000000 for ICM.
ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 26400000 for ICM.
ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 26000000 for ICM.
ib_mthca 0000:08:00.0: Mapped 8 chunks/32 KB at 26890000 for ICM.
ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 26800000 for ICM.
ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 26840000 for ICM.
Unable to handle kernel paging request at 0000001100000019 RIP: 
 <ffffffff803654eb> datagram_poll+0xcc/0xd6
PGD 0 
Oops: 0002 1 SMP 
CPU 0 
Modules linked in: ib_mthca nfs lockd nfs_acl sunrpc ib_ipoib ib_cm ib_sa ib_mad ib_core memtrack ipv6 e1000 dm_mod parport_pc lp parport xfs ata_piix ahci piix mptsas mptscsih mptbase scsi_transport_sas raid0 sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core
Pid: 2170, comm: ntpd Not tainted 2.6.18-xen31-smp #6
RIP: e030:<ffffffff803654eb>  <ffffffff803654eb> datagram_poll+0xcc/0xd6
RSP: e02b:ffff880095e87a88  EFLAGS: 00010246
RAX: 0000001100000011 RBX: ffff8800971e2ac8 RCX: 000000000000000b
RDX: 0000000000000000 RSI: 0000000000000049 RDI: 0000000000000002
RBP: ffff88009c8ad390 R08: ffff880095e86000 R09: ffff880095e87760
R10: ffffffff803a492f R11: ffffffff803a492f R12: 0000000000000005
R13: 0000000000000020 R14: ffff880095e87ef8 R15: 0000000000000008
FS:  00002aaaab383ee0(0000) GS:ffffffff804aa000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process ntpd (pid: 2170, threadinfo ffff880095e86000, task ffff88009c0977e0)
Stack:  ffff8800973547b0 ffffffff803a4942 ffffffff803a492f 0000000000000300
 ffff88009c8ad390 0000000000000005 0000000000000020 ffffffff8028e1d3
 ffff880095e87f40 0000000000000000 ffff880095e87e10 ffff880095e87e18
Call Trace:
 <ffffffff803a4942> udp_poll+0x13/0xf3
 <ffffffff803a492f> udp_poll+0x0/0xf3
 <ffffffff8028e1d3> do_select+0x2aa/0x464
 <ffffffff8028de4c> __pollwait+0x0/0xdd
 <ffffffff8022536b> default_wake_function+0x0/0xe
 <ffffffff8022536b> default_wake_function+0x0/0xe
 <ffffffff8022536b> default_wake_function+0x0/0xe
 <ffffffff8022536b> default_wake_function+0x0/0xe
 <ffffffff8035ff8f> sock_common_recvmsg+0x2d/0x43
 <ffffffff8035f7da> sock_recvmsg+0x101/0x120
 <ffffffff80277073> poison_obj+0x24/0x2d
 <ffffffff80277275> cache_free_debugcheck+0x1f9/0x209
 <ffffffff803a492f> udp_poll+0x0/0xf3
 <ffffffff80277cea> kmem_cache_free+0xd0/0x140
 <ffffffff8028e600> sys_select+0x273/0x3e5
 <ffffffff8020fd0c> init_fpu+0x62/0x7f
 <ffffffff8020ab90> math_state_restore+0x21/0x4a
 <ffffffff8020a117> error_exit+0x0/0x71
 <ffffffff80208f99> sys_rt_sigreturn+0x251/0x301
 <ffffffff802099da> system_call+0x86/0x8b
 <ffffffff80209954> system_call+0x0/0x8b


Code: f0 0f ba 68 08 00 5b 89 f0 c3 41 57 41 89 f7 41 56 41 55 41 
RIP  <ffffffff803654eb> datagram_poll+0xcc/0xd6
 RSP <ffff880095e87a88>
CR2: 0000001100000019
 <1>Unable to handle kernel paging request at 0000000d0000001d RIP: 
 <ffffffff881484e3> :xfs:xfs_file_close+0x1c/0x28
PGD 0 
Oops: 0000 2 SMP 
CPU 0 
Modules linked in: ib_mthca nfs lockd nfs_acl sunrpc ib_ipoib ib_cm ib_sa ib_mad ib_core memtrack ipv6 e1000 dm_mod parport_pc lp parport xfs ata_piix ahci piix mptsas mptscsih mptbase scsi_transport_sas raid0 sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core
Pid: 2170, comm: ntpd Not tainted 2.6.18-xen31-smp #6
RIP: e030:<ffffffff881484e3>  <ffffffff881484e3> :xfs:xfs_file_close+0x1c/0x28
RSP: e02b:ffff880095e87828  EFLAGS: 00010246
RAX: ffff8800971e4078 RBX: ffff88009cbe0bd0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000d0000000d
RBP: ffff880000b1e998 R08: ffff880000c39280 R09: 0000000000000298
R10: ffff880097eb1860 R11: 0000000000000298 R12: ffff880000b1e9a8
R13: 0000000000000009 R14: 0000000000000000 R15: 0000000000000001
FS:  00002aaaab383ee0(0000) GS:ffffffff804aa000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process ntpd (pid: 2170, threadinfo ffff880095e86000, task ffff88009c0977e0)
Stack:  ffffffff881484c7 ffffffff8027aa6a ffff880000b1e998 0000000000000001
 ffff880000b1e9a8 ffffffff8022c9f1 000009c000000a00 ffff880000b1e998
 ffff88009c0977e0 0000000000000001 0000000000000009 ffff880095e879d8
Call Trace:
 <ffffffff881484c7> :xfs:xfs_file_close+0x0/0x28
 <ffffffff8027aa6a> filp_close+0x36/0x64
 <ffffffff8022c9f1> put_files_struct+0x6c/0xbf
 <ffffffff8022dcab> do_exit+0x2ae/0x929
 <ffffffff8020622a> hypercall_page+0x22a/0x1000
 <ffffffff803c6783> do_page_fault+0x119e/0x1253
 <ffffffff8020de8b> monotonic_clock+0x3e/0x86
 <ffffffff803c1abd> thread_return+0x0/0x13d
 <ffffffff8020a117> error_exit+0x0/0x71
 <ffffffff803a492f> udp_poll+0x0/0xf3
 <ffffffff803a492f> udp_poll+0x0/0xf3
 <ffffffff803654eb> datagram_poll+0xcc/0xd6
 <ffffffff80365440> datagram_poll+0x21/0xd6
 <ffffffff803a4942> udp_poll+0x13/0xf3
 <ffffffff803a492f> udp_poll+0x0/0xf3
 <ffffffff8028e1d3> do_select+0x2aa/0x464
 <ffffffff8028de4c> __pollwait+0x0/0xdd
 <ffffffff8022536b> default_wake_function+0x0/0xe
 <ffffffff8022536b> default_wake_function+0x0/0xe
 <ffffffff8022536b> default_wake_function+0x0/0xe
 <ffffffff8022536b> default_wake_function+0x0/0xe
 <ffffffff8035ff8f> sock_common_recvmsg+0x2d/0x43
 <ffffffff8035f7da> sock_recvmsg+0x101/0x120
 <ffffffff80277073> poison_obj+0x24/0x2d
 <ffffffff80277275> cache_free_debugcheck+0x1f9/0x209
 <ffffffff803a492f> udp_poll+0x0/0xf3
 <ffffffff80277cea> kmem_cache_free+0xd0/0x140
 <ffffffff8028e600> sys_select+0x273/0x3e5
 <ffffffff8020fd0c> init_fpu+0x62/0x7f
 <ffffffff8020ab90> math_state_restore+0x21/0x4a
 <ffffffff8020a117> error_exit+0x0/0x71
 <ffffffff80208f99> sys_rt_sigreturn+0x251/0x301
 <ffffffff802099da> system_call+0x86/0x8b
 <ffffffff80209954> system_call+0x0/0x8b


Code: 48 8b 47 10 ff 50 10 41 5b f7 d8 c3 31 c0 48 83 ff 28 51 74 
RIP  <ffffffff881484e3> :xfs:xfs_file_close+0x1c/0x28
 RSP <ffff880095e87828>
CR2: 0000000d0000001d
 <1>Fixing recursive fault but reboot is needed!
syslog-ng1981: segfault at ffffffff80808080 rip 000055555555bc28 rsp 00007fffffffd868 error 6
Slab corruption: start=ffff880096c743b8, len=256
Redzone: 0x1600000016/0x1700000017.
Last user: <0000001800000018>(0x1800000018)
000: 17 00 00 00 17 00 00 00 18 00 00 00 18 00 00 00
010: 19 00 00 00 19 00 00 00 1a 00 00 00 1a 00 00 00
020: 1b 00 00 00 1b 00 00 00 1c 00 00 00 1c 00 00 00
030: 1d 00 00 00 1d 00 00 00 1e 00 00 00 1e 00 00 00
040: 1f 00 00 00 1f 00 00 00 00 00 00 00 00 00 00 00
050: 01 00 00 00 01 00 00 00 02 00 00 00 02 00 00 00
Prev obj: start=0000000396c7420b, len=256
Unable to handle kernel paging request at 0000000396c7430b RIP: 
 <ffffffff80277313> print_objinfo+0x22/0xde
PGD 0 
Oops: 0000 3 SMP 
CPU 0 
Modules linked in: ib_mthca nfs lockd nfs_acl sunrpc ib_ipoib ib_cm ib_sa ib_mad ib_core memtrack ipv6 e1000 dm_mod parport_pc lp parport xfs ata_piix ahci piix mptsas mptscsih mptbase scsi_transport_sas raid0 sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core
Pid: 1981, comm: syslog-ng Not tainted 2.6.18-xen31-smp #6
RIP: e030:<ffffffff80277313>  <ffffffff80277313> print_objinfo+0x22/0xde
RSP: e02b:ffff8800935cdb48  EFLAGS: 00010206
RAX: 0000000396c7430b RBX: 000000000089dac1 RCX: ffffffffff57c000
RDX: 0000000000000002 RSI: 0000000396c74203 RDI: ffff8800015f20c0
RBP: ffff8800015f20c0 R08: ffff880000cbc788 R09: 000000000000d64e
R10: ffff8800935cda98 R11: ffffffff802fd0b5 R12: 0000000396c74203
R13: 0000000000000002 R14: ffff880096c743b0 R15: ffff880096c74000
FS:  00002aaaab0186e0(0000) GS:ffffffff804aa000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process syslog-ng (pid: 1981, threadinfo ffff8800935cc000, task ffff88009c0e0860)
Stack:  000000000089dac1 ffff8800015f20c0 0000000396c74203 0000000000000100
 ffff880096c743b0 ffffffff80277521 ffff8800015f20c0 0000000000000000
 ffff8800015f20c0 ffff880096c743b0 ffffffff802ab3b1 00000000000000d0
Call Trace:
 <ffffffff80277521> check_poison_obj+0x152/0x1ae
 <ffffffff802ab3b1> elf_core_dump+0xe2/0xc2d
 <ffffffff802ab3b1> elf_core_dump+0xe2/0xc2d
 <ffffffff80278269> cache_alloc_debugcheck_after+0x34/0x1b0
 <ffffffff802784d7> kmem_cache_alloc+0xf2/0x102
 <ffffffff802ab3b1> elf_core_dump+0xe2/0xc2d
 <ffffffff8027a67c> do_truncate+0x60/0x69
 <ffffffff802858f4> do_coredump+0x5a0/0x601
 <ffffffff80277cea> kmem_cache_free+0xd0/0x140
 <ffffffff80236db4> __dequeue_signal+0x18b/0x19a
 <ffffffff80238129> get_signal_to_deliver+0x4ee/0x549
 <ffffffff80209115> do_signal+0x55/0x6d8
 <ffffffff803c67db> do_page_fault+0x11f6/0x1253
 <ffffffff88126d95> :xfs:xfs_iunlock+0x4f/0x7a
 <ffffffff8813eb19> :xfs:xfs_fsync+0x157/0x1a9
 <ffffffff80256c9f> __filemap_fdatawrite_range+0x51/0x5b
 <ffffffff8020a055> retint_signal+0x5d/0xb8


Code: 48 8b 18 48 89 ef e8 11 fd ff ff 48 8b 30 48 c7 c7 da c3 3e 
RIP  <ffffffff80277313> print_objinfo+0x22/0xde
 RSP <ffff8800935cdb48>
CR2: 0000000396c7430b
 <3>ib_mthca 0000:08:00.0: Memory key throughput optimization activated.
ib_mthca 0000:08:00.0: Allocated EQ 1 with 131072 entries
ib_mthca 0000:08:00.0: Allocated EQ 2 with 512 entries
ib_mthca 0000:08:00.0: Setting mask 00000000001f47fe for eqn 2
ib_mthca 0000:08:00.0: NOP command failed to generate interrupt (IRQ 16), aborting.
ib_mthca 0000:08:00.0: BIOS or ACPI interrupt routing problem?
ib_mthca 0000:08:00.0: Clearing mask 00000000001f47fe for eqn 2
ib_mthca 0000:08:00.0: HW2SW_EQ failed (-11)
ib_mthca 0000:08:00.0: HW2SW_EQ returned status 0xff
ib_mthca 0000:08:00.0: HW2SW_MPT failed (-11)
ib_mthca 0000:08:00.0: HW2SW_EQ failed (-11)
ib_mthca 0000:08:00.0: HW2SW_EQ returned status 0xff
ib_mthca 0000:08:00.0: HW2SW_MPT failed (-11)
ib_mthca 0000:08:00.0: HW2SW_MPT failed (-11)
ib_mthca 0000:08:00.0: Unmapping 64 pages at 26800000 from ICM.
ib_mthca 0000:08:00.0: Unmapping 64 pages at 26840000 from ICM.
ib_mthca 0000:08:00.0: Unmapping 64 pages at 26890000 from ICM.
ib_mthca 0000:08:00.0: Unmapping 64 pages at 26000000 from ICM.
ib_mthca 0000:08:00.0: Unmapping 64 pages at 26400000 from ICM.
ib_mthca 0000:08:00.0: Unmapping 64 pages at 24000000 from ICM.
ib_mthca 0000:08:00.0: Unmapping 64 pages at 25800000 from ICM.


-- 
Lukáš Hejtmánek



More information about the general mailing list