[openib-general] kernel crashes while using mvapich-gen2 over ib_uverbs

Christian Guggenberger christian.guggenberger at rzg.mpg.de
Thu Aug 25 02:01:51 PDT 2005


Hi,

On a small, 2 node setup, I'd like to try some simple MPI programs with
help of mvapich-gen2 (1.0).
Both nodes are Dual-Opteron based, with a 23108 tavor each, directly
connected. (no switch). Opensm is running on one node. Things like IPOIB
seem to work reliable.

Using 2.6.12.5 (and svn co of Aug, 24th), all I get after starting a
simple 2 CPU mpi programm is a hard crash of that node. (no logs, no
oops, node not pingable, nothing at the console, no SYSRQ available).

I tried to go ahead with plain 2.6.13-rc7 (which already contains
ib_uverbs). This is what I get then:

test[12173] general protection rip:2aaaab219265 rsp:7fffffcc7c50 error:0
test[12174] general protection rip:2aaaab219265 rsp:7fffff980b90 error:0
general protection fault: 0000 [1] SMP
CPU 1
Modules linked in: ib_ipoib ib_sa ib_ucm ib_cm ib_uverbs ib_umad joydev
sg st sr_mod floppy ipv6 ib_mthca ib_mad ib_core hw_random af_packet
evdev tg3 xfs exportfs dm_snapshot dm_mod ext3 jbd
Pid: 12173, comm: test Not tainted 2.6.13-rc7
RIP: 0010:[<ffffffff881e0e73>]
<ffffffff881e0e73>{:ib_uverbs:__ib_umem_release+67}
RSP: 0018:ffff8100d9a1dc48  EFLAGS: 00010246
RAX: 6b6b6b6b6b6b6b6b RBX: ffff8100e2fffcf0 RCX: 0000000000000000
RDX: 000000000000007f RSI: ffff81007dccc018 RDI: 6b6b6b6b6b6b6b6b
RBP: 0000000000000000 R08: 0000000000000000 R09: ffff8100e2fffcf0
R10: ffff8100d9a1dc7f R11: 0000000000003a98 R12: ffff81007dccc000
R13: ffff8100e36c92f0 R14: 0000000000000001 R15: ffff81007fa8e000
FS:  00002aaaab2160a0(0000) GS:ffffffff80571880(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000004204000 CR3: 000000007e324000 CR4: 00000000000006e0
Process test (pid: 12173, threadinfo ffff8100d9a1c000, task
ffff8100e3c3c850)
Stack: ffff8100e2fffcf0 ffff8100e36c9318 6b6b6b6b6b6b6b6b
ffff8100e2fffcf0
       ffff8100e36c92f0 ffff8100e36c92d8 ffff81007fc4a528
ffff81007b2144a8
       ffff810037cfea28 ffffffff881e0eff
Call Trace:<ffffffff881e0eff>{:ib_uverbs:ib_umem_release_on_close+31}
       <ffffffff881de2d5>{:ib_uverbs:ib_uverbs_close+453}
<ffffffff80181912>{__fput+178}
       <ffffffff8017e74e>{filp_close+110}
<ffffffff80136f13>{put_files_struct+115}
       <ffffffff801387bf>{do_exit+511}
<ffffffff80140545>{__dequeue_signal+501}
       <ffffffff801392b0>{sys_exit_group+0}
<ffffffff80142ea7>{get_signal_to_deliver+1415}
       <ffffffff8010d11f>{do_signal+159}
<ffffffff80140b8e>{specific_send_sig_info+222}
       <ffffffff80140deb>{force_sig_info+187}
<ffffffff801106df>{do_general_protection+159}
       <ffffffff8010e34e>{retint_signal+61}

Code: 48 8b 38 e8 25 b0 f3 f7 41 3b 6c 24 10 7d 38 41 8b 45 20 48
RIP <ffffffff881e0e73>{:ib_uverbs:__ib_umem_release+67} RSP
<ffff8100d9a1dc48>
 <1>Fixing recursive fault but reboot is needed!

Any hints regarding a working combination of kernel + openib revision
with respect to mvapich-gen2 are very appreciated.


thanks in advance,
 - Christian





More information about the general mailing list