[ofa-general] OOM problem with ib_ipoib?
John Marshall
John.Marshall at ec.gc.ca
Wed Oct 22 11:16:26 PDT 2008
Hi,
Summary: I believe I have been having an OOM problem caused by the
ib_ipoib module. I do not see the problem until it is
loaded. The problem manifests itself when the kernel cache
(grep Cached /proc/meminfo) containing file data is maxed
out. Normally, the cached data should be written out and
released by pdflush. In this case, it is not.
Notes:
1) it is NOT necessary for the ib interfaces to actually
be used or up!
2) I am using ofed 1.3.2 which I have built on my own
machine.
3) I have similar weird behavior when using 1.4-rc3
and a 2.6.26 kernel.
----------
System info:
root# lsmod | grep ib
ib_ipoib 77512 0
ib_cm 33260 1 ib_ipoib
ib_sa 36628 2 ib_ipoib,ib_cm
ib_mthca 124832 0
ib_umad 16232 0
ib_uverbs 38792 0
ib_mad 35188 4 ib_cm,ib_sa,ib_mthca,ib_umad
ib_core 54304 7 ib_ipoib,ib_cm,ib_sa,ib_mthca,ib_umad,ib_uverbs,ib_mad
ipv6 242980 29 ib_ipoib
libata 145584 1 ata_generic
scsi_mod 142316 6 sg,sr_mod,usb_storage,sd_mod,megaraid_sas,libata
root# uname -r
2.6.24-etchnhalf.1-686-bigmem
root# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 65
model name : Dual-Core AMD Opteron(tm) Processor 8220
stepping : 3
cpu MHz : 2793.163
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy ts fid vid ttp tm stc
bogomips : 5589.70
clflush size : 64
***** 7 more similar entries (2 cpu, 4-core each) ****
root# cat /proc/meminfo
cat /proc/meminfo
MemTotal: 33274492 kB
MemFree: 147716 kB
Buffers: 840 kB
Cached: 32532792 kB
SwapCached: 0 kB
Active: 19956 kB
Inactive: 32524692 kB
HighTotal: 32635808 kB
HighFree: 77008 kB
LowTotal: 638684 kB
LowFree: 70708 kB
SwapTotal: 16386260 kB
SwapFree: 16386168 kB
Dirty: 88 kB
Writeback: 0 kB
AnonPages: 11032 kB
Mapped: 7940 kB
Slab: 537012 kB
SReclaimable: 487100 kB
SUnreclaim: 49912 kB
PageTables: 656 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 33023504 kB
Committed_AS: 61360 kB
VmallocTotal: 118776 kB
VmallocUsed: 96800 kB
VmallocChunk: 13112 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
# dpkg -l |grep ofed
ii libibcm 1.0.2-1 ofed-1.3.2: libibcm
ii libibcommon 1.0.8-1 ofed-1.3.2: libibcommon
ii libibmad 1.1.6-1 ofed-1.3.2: libibmad
ii libibumad 1.1.7-1 ofed-1.3.2: libibumad
ii libibverbs 1.1.1-1 ofed-1.3.2: libibverbs
ii libipathverbs 1.1-1 ofed-1.3.2: libipathverbs
ii libmlx4 1.0-1 ofed-1.3.2: libmlx
ii libmthca 1.0.4-1 ofed-1.3.2: libmthca
ii librdmacm 1.0.7-1 ofed-1.3.2: librdmacm
ii libsdp 1.1.99-1 ofed-1.3.2: libsdp
ii ofa-kernel 1.3.2-2.6.24-etchnhalf.1-686-bigmem-1 ofed-1.3.2: ofa_kernel
----------
How to provoke #1 (prior to loading ib_ipoib):
non-root$ dd if=/dev/zero of=/tmp/50G bs=1M count=50000
root# modprobe ib_ipoib
Output from dmesg:
modprobe: page allocation failure. order:1, mode:0x20
Pid: 6839, comm: modprobe Not tainted 2.6.24-etchnhalf.1-686-bigmem #1
[<c0161904>] __alloc_pages+0x2c4/0x2d5
[<c017a05c>] cache_alloc_refill+0x299/0x4b1
[<c017a2e9>] __kmalloc+0x75/0xbc
[<c025eafb>] __alloc_skb+0x49/0xf5
[<f8d4677f>] ipoib_cm_alloc_rx_skb+0x31/0x218 [ib_ipoib]
[<f8d48c09>] ipoib_cm_dev_init+0x50c/0x552 [ib_ipoib]
[<c0249944>] dma_pool_free+0xb0/0x18c
[<f8d45bed>] ipoib_transport_dev_init+0xd2/0x3d1 [ib_ipoib]
[<f8d42c6d>] ipoib_ib_dev_init+0x2c/0x6e [ib_ipoib]
[<f8d3f7b3>] ipoib_dev_init+0xab/0xd0 [ib_ipoib]
[<f8d3f9f8>] ipoib_add_one+0x220/0x3cf [ib_ipoib]
[<c011fef8>] resched_task+0x52/0x54
[<f89b13e2>] ib_register_client+0x48/0x6c [ib_core]
[<f89890d2>] ipoib_init_module+0xd2/0xf8 [ib_ipoib]
[<c0145a27>] sys_init_module+0x15e3/0x16fb
[<c0166432>] vma_prio_tree_insert+0x17/0x2a
[<c017a274>] __kmalloc+0x0/0xbc
[<c0103ede>] syscall_call+0x7/0xb
=======================
Mem-info:
DMA per-cpu:
CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 2: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 3: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 4: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 5: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 6: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 7: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
Normal per-cpu:
CPU 0: Hot: hi: 186, btch: 31 usd: 121 Cold: hi: 62, btch: 15 usd: 58
CPU 1: Hot: hi: 186, btch: 31 usd: 42 Cold: hi: 62, btch: 15 usd: 26
CPU 2: Hot: hi: 186, btch: 31 usd: 152 Cold: hi: 62, btch: 15 usd: 57
CPU 3: Hot: hi: 186, btch: 31 usd: 63 Cold: hi: 62, btch: 15 usd: 59
CPU 4: Hot: hi: 186, btch: 31 usd: 72 Cold: hi: 62, btch: 15 usd: 55
CPU 5: Hot: hi: 186, btch: 31 usd: 174 Cold: hi: 62, btch: 15 usd: 61
CPU 6: Hot: hi: 186, btch: 31 usd: 66 Cold: hi: 62, btch: 15 usd: 48
CPU 7: Hot: hi: 186, btch: 31 usd: 35 Cold: hi: 62, btch: 15 usd: 54
HighMem per-cpu:
CPU 0: Hot: hi: 186, btch: 31 usd: 31 Cold: hi: 62, btch: 15 usd: 9
CPU 1: Hot: hi: 186, btch: 31 usd: 30 Cold: hi: 62, btch: 15 usd: 5
CPU 2: Hot: hi: 186, btch: 31 usd: 93 Cold: hi: 62, btch: 15 usd: 8
CPU 3: Hot: hi: 186, btch: 31 usd: 3 Cold: hi: 62, btch: 15 usd: 14
CPU 4: Hot: hi: 186, btch: 31 usd: 37 Cold: hi: 62, btch: 15 usd: 53
CPU 5: Hot: hi: 186, btch: 31 usd: 67 Cold: hi: 62, btch: 15 usd: 49
CPU 6: Hot: hi: 186, btch: 31 usd: 15 Cold: hi: 62, btch: 15 usd: 30
CPU 7: Hot: hi: 186, btch: 31 usd: 138 Cold: hi: 62, btch: 15 usd: 61
Active:5136 inactive:8135705 dirty:12 writeback:0 unstable:0
free:15715 slab:136280 mapped:2348 pagetables:164 bounce:0
DMA free:3524kB min:68kB low:84kB high:100kB active:0kB inactive:0kB present:16256kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 873 34020 34020
Normal free:1368kB min:3744kB low:4680kB high:5616kB active:288kB inactive:252kB present:894080kB pages_scanned:32 all_unreclaimable? no
lowmem_reserve[]: 0 0 265176 265176
HighMem free:59588kB min:512kB low:36080kB high:71652kB active:20256kB inactive:32541032kB present:33942528kB pages_scanned:32 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 2*4kB 4*8kB 4*16kB 4*32kB 5*64kB 1*128kB 3*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3496kB
Normal: 0*4kB 0*8kB 1*16kB 0*32kB 1*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1232kB
HighMem: 34*4kB 23*8kB 28*16kB 2*32kB 4*64kB 1*128kB 4*256kB 3*512kB 2*1024kB 5*2048kB 11*4096kB = 61120kB
Swap cache: add 27, delete 27, find 1/2, race 0+0
Free swap = 16386168kB
Total swap = 16386260kB
Free swap: 16386168kB
8781824 pages of RAM
8552448 pages of HIGHMEM
463201 reserved pages
8140201 pages shared
0 pages swap cached
12 pages dirty
0 pages writeback
2382 pages mapped
136255 pages slab
167 pages pagetables
ib%d: failed to allocate receive buffer 144
----------
How to provoke #2 (with ib_ipoib loaded):
non-root$ dd if=/dev/zero of=/tmp/50G bs=1M count=50000
This results in an OOM triggering the OOM-killer which starts killing
processes.
----------
Any help would be appreciated, as well as confirmation of the same
sort of behavior.
Thanks,
John
More information about the general
mailing list