[openib-general] problems to regiser memory as a reglar user on SLES9 SP3

Christian Guggenberger christian.guggenberger at rzg.mpg.de
Sun Sep 3 10:53:46 PDT 2006


Hi,
On Tue, Aug 29, 2006 at 05:49:32PM +0300, Tziporet Koren wrote:
> Hi All,
> In testing today we found that on SLES9 SP3 memory locking as a regular 
> user fails.
has any progress been made regarding this ?

I'd like to ask if the SLES9 port is really mature yet, because I tried
to go a step ahead and tried some trivial MPI code as root, but failed
and got the involved node locked down hard.
Testing was done on a single x86_64 SMP node (2 CPUs), with a Mellanox
PCI-X HCA (23108, FW-3.5.0). Software Environment SLES9 SP3-latest,
OFED-1.1-rc3 and mvapich2-0.9.5.
Attached is a simple MPI code that causes the hard lock. Also attached
are some Kernel BUGs gathered via serial console - they look garbled,
unfortunately.
Note, everything is fine, if I use recent vanilla kernels on that SLES9
machine.

cheers.
 - Christian

-- 
-----------------------------------------------------------
Phone	+49-89-3299-1306
PGP 	http://www.rzg.mpg.de/~ccg/cg-public_key.asc
S/MIME 	http://ra.rzg.mpg.de
-----------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.c
Type: text/x-csrc
Size: 1260 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060903/f1a48542/attachment.c>
-------------- next part --------------
Kernel BUG at page_alloc:853
invalid operand: 0000 [1] SMP
CPU 0
Pid: 7092, comm: hanger Tainted: PF  U   (2.6.5-7.276-smp SLES9_SP3_BRANCH-20060724104531)
RIP: 0010:[<ffffffff8016ad9e>] <ffffffff8016ad9e>{__free_pages+30}
RSP: 0018:00000100e3fdbbf0  EFLAGS: 00010256
RAX: 0000000000000000 RBX: 00000100e72d1280 RCX: 000001000000d000
RDX: 0000010002a1c4d8 RSI: 0000000000000000 RDI: 0000010002a1c4d8
RBP: 00000100e3fdbcc8 R08: 00000100e3fda000 R09: 0000000000000002
R10: 0000000000000064 R11: 0000000000000001 R12: 0000000000000000
R13: 00000100e72d1280 R14: 000001007e644d90 R15: 00000000000493e0
FS:  0000002a95bb5b00(0000) GS:ffffffff8057dc00(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000041b009 CR3: 0000000000101000 CR4: 00000000000006e0
Process hanger (pid: 7092, threadinfo 00000100e3fda000, task 000001007e644d90)
Stack: ffffffff8013bd3f 0000000000000000 ffffffff801395a0 ffffffff803d3400
       0000000000000246 00000000000339b3 0000000000000202 0000010002c1c600
       000000000000006a 0000010002c1d6e0
Call Trace:<ffffffff8013bd3f>{__mmdrop+63} <ffffffff801395a0>{thread_return+108}
       <ffffffff801467b0>{process_timeout+0} <ffffffff80147376>{schedule_timeout+246}
       <ffffffff801467b0>{process_timeout+0} <ffffffffa017f460>{:ib_mthca:mthca_cmd_wait+448}
       <ffffffff80135cd0>{default_wake_function+0} <ffffffff80135cd0>{default_wake_function+0}
       <ffffffffa017f622>{:ib_mthca:mthca_cmd_box+66} <ffffffffa017fd59>{:ib_mthca:mthca_HW2SW_MPT+57}
       <ffffffffa0189423>{:ib_mthca:mthca_free_mr+67} <ffffffffa019014f>{:ib_mthca:mthca_dereg_mr+15}
       <ffffffffa0149e3a>{:ib_core:ib_dereg_mr+26} <ffffffffa01e5543>{:ib_uverbs:ib_uverbs_close+611}
       <ffffffff8018e332>{__fput+98} <ffffffff80189ffe>{filp_close+126}
       <ffffffff8018a105>{sys_close+229} <ffffffff801106b4>{system_call+124}


Code: 0f 0b f4 8b 38 80 ff ff ff ff 55 03 66 66 90 66 66 90 f0 83
RIP <ffffffff8016ad9e>{__free_pages+30} RSP <00000100e3fdbbf0>
 ----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at page_alloc:853
invalid operand: 0000 [2] SMP
CPU 1
Pid: 1, comm: init Tainted: PF  U   (2.6.5-7.276-smp SLES9_SP3_BRANCH-20060724104531)
RIP: 0010:[<ffffffff8016ad9e>] <ffffffff8016ad9e>{__free_pages+30}
RSP: 0018:000001007ff81c80  EFLAGS: 00010256
RAX: 0000000000000000 RBX: 000001007e1e4980 RCX: 0000010080000000
RDX: 00000100815b6068 RSI: 0000000000000000 RDI: 00000100815b6068
RBP: 000001007ff81d58 R08: 000001007ff80000 R09: 0000000000000013
R10: 00000000000493e0 R11: 0000000000002710 R12: 0000000000000001
R13: 000001007e1e4980 R14: 00000100e7f3f2c0 R15: 00000000000493e0
FS:  0000002a95bb5b00(0000) GS:ffffffff8057dc80(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000041b009 CR3: 000000007ff82000 CR4: 00000000000006e0
Process init (pid: 1, threadinfo 000001007ff80000, task 00000100e7f3f2c0)
Stack: ffffffff8013bd3f 0000000000000040 ffffffff801395a0 00000100e7f3e9a0
       000000d07f8a1580 0000000000000246 0000000000000001 00000100816f5580
       000000010000007d 00000100816f6660
Call Trace:<ffffffff8013bd3f>{__mmdrop+63} <ffffffff801395a0>{thread_return+108}
       <ffffffff80147376>{schedule_timeout+246} <ffffffff801467b0>{process_timeout+0}
       <ffffffff801a3f61>{do_select+1105} <ffffffff801a35a0>{__pollwait+0}
       <ffffffff801a4366>{sys_select+902} <ffffffff801106b4>{system_call+124}


Code: 0f 0b f4 8b 38 80 ff ff ff ff 55 03 66 66 90 66 66 90 f0 83
RIP <ffffffff8016ad9e>{__free_pages+30} RSP <000001007ff81c80>
 b-<-0>-K--er--ne--l - p[ancuict : hAertte em] pt-e--d --to-- k-i- ll[p ileniatse!
  ite here B] ad-- p--a-ge-- s--ta
roK aert nferl eBe_UhG otat_c poaldge_p_aaglle oc(:in85 p3
0 ceinssv al'hidan ogeper'ra, ndpa: ge00 00000 [0301] 008SM1P5b 6
 68)CP
U f0 la<gs4>:0
x0P50id00:0 58025 m9,ap cpionmmg:: 00kl00og00d 00Ta00in00te0d00: 0 PFma  ppU ed  :(0 2.co6.un5-t:7.0 2p76ri-svampte S:0LxES009_00SP003_00BR
ANBCHac-2kt00r6ac07e:24
104
l3C1)al
t_RTrIPac: e:00<10ff:[ff<ffffffff8ff01ff6a806a168>ad{b9ead>]_p ag<ef+f1f2f0f}ff f80<f16ffadff9eff>{f8__0f16reaae_7fpa>{gefrs+ee30_h}o
  cRolSPd_: pa00ge18+1:0403}00<014>00 e4
e87  d4  0    E FL<fAGffS:ff 0f0ff0180021356bd
3fR>{AX__: mm0d0r0o00p+006300}00<04>00 0<0f RffBXff: ff00f800001310950ea072>{d1th28r0ea Rd_CXre: tu0r00n+0010108}000 00
  0  0
se  RD X:< 04>00<f00ff10ff00ff2af81c014da887 R41SI>{: dp00ut00+30300}00 00<00ff00ff0 ffRDfIf8: 01008900ff0e10>{00fi2alp1c_c4dlo8
10+1RB26P:} 0<400> 00
ff0e  4 e8  7e  18 <R0ff8:ff f00ff00f80101080ea14e0586>{00s0ys R_c09lo: se00+022009}000 00<0ff00ff01ff3
0080R1110:07 01e00>0{s00ys00re00t_04c9ar3eef0 ulR+1113: }0<004>00 0
R10 00  02  71 0
  2:Tr 0yi00ng00 0to00 f00ix00 i00t 00up
, Rbu13t : a0 r00eb00oo10t 0eis72 dne12ed80ed R
02:ha 0ng00er0[01700093e4]:1d sf4egb0f auR1lt5: a 0t 0000000000000200a904579381e03
0  FrSip:  0 0000000000202a9a9575889134b0200 r(s00p 0000) 00GS00:f7fffbfffffffff0f808 5e7drrc0or0( 1004
 0) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000041b009 CR3: 0000000000101000 CR4: 00000000000006e0
Process klogd (pid: 5259, threadinfo 00000100e4e86000, task 00000100e41df4b0)
Stack: ffffffff8013bd3f 0000000000008040 ffffffff801395a0 00000100e395f5b0
       0000000000000002 0000002a9556c010 000000000003ffff 0000000000040000
       000000009566b1c0 0000010002c1d6e0
Call Trace:<ffffffff8013bd3f>{__mmdrop+63} <ffffffff801395a0>{thread_return+108}
       <ffffffff8018d5bd>{do_sync_write+173} <ffffffff8013ea60>{do_syslog+384}
       <ffffffff8013d430>{autoremove_wake_function+0} <ffffffff8013d430>{autoremove_wake_function+0}
       <ffffffff801c9022>{kmsg_read+66} <ffffffff8018db84>{vfs_read+244}
       <ffffffff8018dddd>{sys_read+157} <ffffffff801106b4>{system_call+124}


Code: 0f 0b f4 8b 38 80 ff ff ff ff 55 03 66 66 90 66 66 90 f0 83
RIP <ffffffff8016ad9e>{__free_pages+30} RSP <00000100e4e87d40>
 ----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at page_alloc:853
 <0>invalid operand: 0000 [4] SMP
CPU 1
Pid: 7091, comm: python2.3 Tainted: PF  U B (2.6.5-7.276-smp SLES9_SP3_BRANCH-20060724104531)
RIP: 0010:[<ffffffff8016ad9e>] <ffffffff8016ad9e>{__free_pages+30}
RSP: 0000:00000100e32c3c80  EFLAGS: 00010256
RAX: 0000000000000000 RBX: 000001007e1e4980 RCX: 0000010080000000
RDX: 00000100815b6068 RSI: 0000000000000000 RDI: 00000100815b6068
RBP: 00000100e32c3d58 R08: 00000100e32c2000 R09: 0000000000000013
R10: 00000000000493e0 R11: 0000000000002710 R12: 0000000000000001
R13: 000001007e1e4980 R14: 000001007ec47590 R15: 00000000000493e0
FS:  0000002a96202320(0000) GS:ffffffff8057dc80(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a95781302 CR3: 000000007ff82000 CR4: 00000000000006e0
Process python2.3 (pid: 7091, threadinfo 00000100e32c2000, task 000001007ec47590)
Stack: ffffffff8013bd3f 0000000000000504 ffffffff801395a0 000001007e7cedb0
       0000010077509c80 0000000000000256 0000000080004380 00000100816f5580
       000000010000007d 00000100816f6660
Call Trace:<ffffffff8013bd3f>{__mmdrop+63} <ffffffff801395a0>{thread_return+108}
       <ffffffff80147376>{schedule_timeout+246} <ffffffff801467b0>{process_timeout+0}
       <ffffffff801a3f61>{do_select+1105} <ffffffff802e4dc6>{sys_sendto+246}
       <ffffffff801a35a0>{__pollwait+0} <ffffffff801a4366>{sys_select+902}
       <ffffffff801106b4>{system_call+124}

Code: 0f 0b f4 8b 38 80 ff ff ff ff 55 03 66 66 90 66 66 90 f0 83
RIP <ffffffff8016ad9e>{__free_pages+30} RSP <00000100e32c3c80>
 <1>Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
<ffffffff80137703>{find_busiest_group+659}
PML4 e3371067 PGD e3374067 PMD 0
Oops: 0000 [5] SMP
CPU 1
Pid: 7091, comm: python2.3 Tainted: PF  U B (2.6.5-7.276-smp SLES9_SP3_BRANCH-20060724104531)
RIP: 0010:[<ffffffff80137703>] <ffffffff80137703>{find_busiest_group+659}
RSP: 0000:00000100e7e07df0  EFLAGS: 00010006
RAX: 00000100e7e07eb8 RBX: 0000000000000000 RCX: 0000000000000080
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 0000000000000040
RBP: 00000100e7e07e90 R08: 0000000000000040 R09: ffffffff805c3200
R10: 0000000000000064 R11: 00000000000002ff R12: 00000000000002ff
R13: ffffffff804aa7a0 R14: 0000000000000001 R15: 0000000000000000
FS:  0000002a96202320(0000) GS:ffffffff8057dc80(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000018 CR3: 000000007ff82000 CR4: 00000000000006e0
Process python2.3 (pid: 7091, threadinfo 00000100e32c2000, task 000001007ec47590)
Stack: 00000100e7e07e50 0000000000000000 0000000000000001 0000000000000080
       0000000000000000 0000000000017f80 0000000000000000 ffffffff804aa780
       0000000102ebfb80 00000100e7e07eb8
Call Trace:<IRQ> <ffffffff8013a63c>{rebalance_tick+460} <ffffffff8011d674>{smp_apic_timer_interrupt+52}
       <ffffffff80110e27>{apic_timer_interrupt+99} <ffffffff8011c93f>{smp_stop_cpu+31}
       <ffffffff8011c949>{smp_really_stop_cpu+9} <ffffffff8011c8b0>{smp_call_function_interrupt+64}
       <ffffffff80110dbf>{call_function_interrupt+99}  <EOI> <ffffffff80111bf3>{oops_end+35}
       <ffffffff80111be5>{oops_end+21} <ffffffff801124fb>{die+59}
       <ffffffff80112d21>{do_invalid_op+145} <ffffffff8016ad9e>{__free_pages+30}
       <ffffffff8031d197>{tcp_transmit_skb+1479} <ffffffff80110f79>{error_exit+0}
       <ffffffff8016ad9e>{__free_pages+30} <ffffffff8013bd3f>{__mmdrop+63}
       <ffffffff801395a0>{thread_return+108} <ffffffff80147376>{schedule_timeout+246}
       <ffffffff801467b0>{process_timeout+0} <ffffffff801a3f61>{do_select+1105}
       <ffffffff802e4dc6>{sys_sendto+246} <ffffffff801a35a0>{__pollwait+0}
       <ffffffff801a4366>{sys_select+902} <ffffffff801106b4>{system_call+124}


Code: 48 8b 43 18 48 39 c8 48 0f 47 c1 48 0f af d0 48 c1 ea 07 48
RIP <ffffffff80137703>{find_busiest_group+659} RSP <00000100e7e07df0>
CR2: 0000000000000018
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5594 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060903/f1a48542/attachment.bin>


More information about the general mailing list