[openib-general] kernel crashes while using mvapich-gen2 over ib_uverbs

Weikuan Yu yuw at cse.ohio-state.edu
Thu Aug 25 07:13:43 PDT 2005


Hi, Christian,

It seems like the ib_uverbs module is not able to clean up the leftover 
pinned memory when closing a context of a user process. In my opinion, 
it should only kill the user-process with/wo core-dump, but not kernel 
oops as reported. Somebody developed the code-path for umem 
registration + deregistration can help more here.

> Any hints regarding a working combination of kernel + openib revision
> with respect to mvapich-gen2 are very appreciated.

As to the combinations over Dual-opteron, mvapich-gen2 has been tested 
with 2.6.12.4.tar.gz + gen2-r2984 (userland+kernel) when we made the 
release. We are keen to keep updated with the latest gen2 stack. 
Chances are we will lag behind for a little bit.

Thanks,
Weikuan

On Aug 25, 2005, at 5:01 AM, Christian Guggenberger wrote:

> Hi,
>
> On a small, 2 node setup, I'd like to try some simple MPI programs with
> help of mvapich-gen2 (1.0).
> Both nodes are Dual-Opteron based, with a 23108 tavor each, directly
> connected. (no switch). Opensm is running on one node. Things like 
> IPOIB
> seem to work reliable.
>
> Using 2.6.12.5 (and svn co of Aug, 24th), all I get after starting a
> simple 2 CPU mpi programm is a hard crash of that node. (no logs, no
> oops, node not pingable, nothing at the console, no SYSRQ available).
>
> I tried to go ahead with plain 2.6.13-rc7 (which already contains
> ib_uverbs). This is what I get then:
>
> test[12173] general protection rip:2aaaab219265 rsp:7fffffcc7c50 
> error:0
> test[12174] general protection rip:2aaaab219265 rsp:7fffff980b90 
> error:0
> general protection fault: 0000 [1] SMP
> CPU 1
> Modules linked in: ib_ipoib ib_sa ib_ucm ib_cm ib_uverbs ib_umad joydev
> sg st sr_mod floppy ipv6 ib_mthca ib_mad ib_core hw_random af_packet
> evdev tg3 xfs exportfs dm_snapshot dm_mod ext3 jbd
> Pid: 12173, comm: test Not tainted 2.6.13-rc7
> RIP: 0010:[<ffffffff881e0e73>]
> <ffffffff881e0e73>{:ib_uverbs:__ib_umem_release+67}
> RSP: 0018:ffff8100d9a1dc48  EFLAGS: 00010246
> RAX: 6b6b6b6b6b6b6b6b RBX: ffff8100e2fffcf0 RCX: 0000000000000000
> RDX: 000000000000007f RSI: ffff81007dccc018 RDI: 6b6b6b6b6b6b6b6b
> RBP: 0000000000000000 R08: 0000000000000000 R09: ffff8100e2fffcf0
> R10: ffff8100d9a1dc7f R11: 0000000000003a98 R12: ffff81007dccc000
> R13: ffff8100e36c92f0 R14: 0000000000000001 R15: ffff81007fa8e000
> FS:  00002aaaab2160a0(0000) GS:ffffffff80571880(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000004204000 CR3: 000000007e324000 CR4: 00000000000006e0
> Process test (pid: 12173, threadinfo ffff8100d9a1c000, task
> ffff8100e3c3c850)
> Stack: ffff8100e2fffcf0 ffff8100e36c9318 6b6b6b6b6b6b6b6b
> ffff8100e2fffcf0
>        ffff8100e36c92f0 ffff8100e36c92d8 ffff81007fc4a528
> ffff81007b2144a8
>        ffff810037cfea28 ffffffff881e0eff
> Call Trace:<ffffffff881e0eff>{:ib_uverbs:ib_umem_release_on_close+31}
>        <ffffffff881de2d5>{:ib_uverbs:ib_uverbs_close+453}
> <ffffffff80181912>{__fput+178}
>        <ffffffff8017e74e>{filp_close+110}
> <ffffffff80136f13>{put_files_struct+115}
>        <ffffffff801387bf>{do_exit+511}
> <ffffffff80140545>{__dequeue_signal+501}
>        <ffffffff801392b0>{sys_exit_group+0}
> <ffffffff80142ea7>{get_signal_to_deliver+1415}
>        <ffffffff8010d11f>{do_signal+159}
> <ffffffff80140b8e>{specific_send_sig_info+222}
>        <ffffffff80140deb>{force_sig_info+187}
> <ffffffff801106df>{do_general_protection+159}
>        <ffffffff8010e34e>{retint_signal+61}
>
> Code: 48 8b 38 e8 25 b0 f3 f7 41 3b 6c 24 10 7d 38 41 8b 45 20 48
> RIP <ffffffff881e0e73>{:ib_uverbs:__ib_umem_release+67} RSP
> <ffff8100d9a1dc48>
>  <1>Fixing recursive fault but reboot is needed!
>
> Any hints regarding a working combination of kernel + openib revision
> with respect to mvapich-gen2 are very appreciated.
>
>
> thanks in advance,
>  - Christian
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
>




More information about the general mailing list