[ofa-general] OOM problem with ib_ipoib?
John Marshall
John.Marshall at ec.gc.ca
Wed Oct 22 15:16:13 PDT 2008
John Marshall wrote:
> Hi,
>
> Summary: I believe I have been having an OOM problem caused by the
> ib_ipoib module. I do not see the problem until it is
> loaded. The problem manifests itself when the kernel cache
> (grep Cached /proc/meminfo) containing file data is maxed
> out. Normally, the cached data should be written out and
> released by pdflush. In this case, it is not.
>
> Notes:
> 1) it is NOT necessary for the ib interfaces to actually
> be used or up!
> 2) I am using ofed 1.3.2 which I have built on my own
> machine.
> 3) I have similar weird behavior when using 1.4-rc3
> and a 2.6.26 kernel.
An additional item: when rebuilt for the same 2.6.24 kernel
as mentioned below, but without BIGMEM, I do not encounter
the same problem.
>
> ----------
>
> System info:
>
> root# lsmod | grep ib
> ib_ipoib 77512 0
> ib_cm 33260 1 ib_ipoib
> ib_sa 36628 2 ib_ipoib,ib_cm
> ib_mthca 124832 0
> ib_umad 16232 0
> ib_uverbs 38792 0
> ib_mad 35188 4 ib_cm,ib_sa,ib_mthca,ib_umad
> ib_core 54304 7
> ib_ipoib,ib_cm,ib_sa,ib_mthca,ib_umad,ib_uverbs,ib_mad
> ipv6 242980 29 ib_ipoib
> libata 145584 1 ata_generic
> scsi_mod 142316 6
> sg,sr_mod,usb_storage,sd_mod,megaraid_sas,libata
>
> root# uname -r
> 2.6.24-etchnhalf.1-686-bigmem
>
> root# cat /proc/cpuinfo
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 15
> model : 65
> model name : Dual-Core AMD Opteron(tm) Processor 8220
> stepping : 3
> cpu MHz : 2793.163
> cache size : 1024 KB
> physical id : 0
> siblings : 2
> core id : 0
> cpu cores : 2
> fdiv_bug : no
> hlt_bug : no
> f00f_bug : no
> coma_bug : no
> fpu : yes
> fpu_exception : yes
> cpuid level : 1
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
> fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm
> extapic cr8_legacy ts fid vid ttp tm stc
> bogomips : 5589.70
> clflush size : 64
>
> ***** 7 more similar entries (2 cpu, 4-core each) ****
>
> root# cat /proc/meminfo
> cat /proc/meminfo
> MemTotal: 33274492 kB
> MemFree: 147716 kB
> Buffers: 840 kB
> Cached: 32532792 kB
> SwapCached: 0 kB
> Active: 19956 kB
> Inactive: 32524692 kB
> HighTotal: 32635808 kB
> HighFree: 77008 kB
> LowTotal: 638684 kB
> LowFree: 70708 kB
> SwapTotal: 16386260 kB
> SwapFree: 16386168 kB
> Dirty: 88 kB
> Writeback: 0 kB
> AnonPages: 11032 kB
> Mapped: 7940 kB
> Slab: 537012 kB
> SReclaimable: 487100 kB
> SUnreclaim: 49912 kB
> PageTables: 656 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> CommitLimit: 33023504 kB
> Committed_AS: 61360 kB
> VmallocTotal: 118776 kB
> VmallocUsed: 96800 kB
> VmallocChunk: 13112 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
>
> # dpkg -l |grep ofed
> ii libibcm
> 1.0.2-1 ofed-1.3.2: libibcm
> ii libibcommon
> 1.0.8-1 ofed-1.3.2: libibcommon
> ii libibmad
> 1.1.6-1 ofed-1.3.2: libibmad
> ii libibumad
> 1.1.7-1 ofed-1.3.2: libibumad
> ii libibverbs
> 1.1.1-1 ofed-1.3.2: libibverbs
> ii libipathverbs
> 1.1-1 ofed-1.3.2: libipathverbs
> ii libmlx4
> 1.0-1 ofed-1.3.2: libmlx
> ii libmthca
> 1.0.4-1 ofed-1.3.2: libmthca
> ii librdmacm
> 1.0.7-1 ofed-1.3.2: librdmacm
> ii libsdp
> 1.1.99-1 ofed-1.3.2: libsdp
> ii ofa-kernel
> 1.3.2-2.6.24-etchnhalf.1-686-bigmem-1 ofed-1.3.2: ofa_kernel
>
> ----------
>
> How to provoke #1 (prior to loading ib_ipoib):
>
> non-root$ dd if=/dev/zero of=/tmp/50G bs=1M count=50000
>
> root# modprobe ib_ipoib
>
> Output from dmesg:
>
> modprobe: page allocation failure. order:1, mode:0x20
> Pid: 6839, comm: modprobe Not tainted 2.6.24-etchnhalf.1-686-bigmem #1
> [<c0161904>] __alloc_pages+0x2c4/0x2d5
> [<c017a05c>] cache_alloc_refill+0x299/0x4b1
> [<c017a2e9>] __kmalloc+0x75/0xbc
> [<c025eafb>] __alloc_skb+0x49/0xf5
> [<f8d4677f>] ipoib_cm_alloc_rx_skb+0x31/0x218 [ib_ipoib]
> [<f8d48c09>] ipoib_cm_dev_init+0x50c/0x552 [ib_ipoib]
> [<c0249944>] dma_pool_free+0xb0/0x18c
> [<f8d45bed>] ipoib_transport_dev_init+0xd2/0x3d1 [ib_ipoib]
> [<f8d42c6d>] ipoib_ib_dev_init+0x2c/0x6e [ib_ipoib]
> [<f8d3f7b3>] ipoib_dev_init+0xab/0xd0 [ib_ipoib]
> [<f8d3f9f8>] ipoib_add_one+0x220/0x3cf [ib_ipoib]
> [<c011fef8>] resched_task+0x52/0x54
> [<f89b13e2>] ib_register_client+0x48/0x6c [ib_core]
> [<f89890d2>] ipoib_init_module+0xd2/0xf8 [ib_ipoib]
> [<c0145a27>] sys_init_module+0x15e3/0x16fb
> [<c0166432>] vma_prio_tree_insert+0x17/0x2a
> [<c017a274>] __kmalloc+0x0/0xbc
> [<c0103ede>] syscall_call+0x7/0xb
> =======================
> Mem-info:
> DMA per-cpu:
> CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch:
> 1 usd: 0
> CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch:
> 1 usd: 0
> CPU 2: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch:
> 1 usd: 0
> CPU 3: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch:
> 1 usd: 0
> CPU 4: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch:
> 1 usd: 0
> CPU 5: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch:
> 1 usd: 0
> CPU 6: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch:
> 1 usd: 0
> CPU 7: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch:
> 1 usd: 0
> Normal per-cpu:
> CPU 0: Hot: hi: 186, btch: 31 usd: 121 Cold: hi: 62, btch:
> 15 usd: 58
> CPU 1: Hot: hi: 186, btch: 31 usd: 42 Cold: hi: 62, btch:
> 15 usd: 26
> CPU 2: Hot: hi: 186, btch: 31 usd: 152 Cold: hi: 62, btch:
> 15 usd: 57
> CPU 3: Hot: hi: 186, btch: 31 usd: 63 Cold: hi: 62, btch:
> 15 usd: 59
> CPU 4: Hot: hi: 186, btch: 31 usd: 72 Cold: hi: 62, btch:
> 15 usd: 55
> CPU 5: Hot: hi: 186, btch: 31 usd: 174 Cold: hi: 62, btch:
> 15 usd: 61
> CPU 6: Hot: hi: 186, btch: 31 usd: 66 Cold: hi: 62, btch:
> 15 usd: 48
> CPU 7: Hot: hi: 186, btch: 31 usd: 35 Cold: hi: 62, btch:
> 15 usd: 54
> HighMem per-cpu:
> CPU 0: Hot: hi: 186, btch: 31 usd: 31 Cold: hi: 62, btch:
> 15 usd: 9
> CPU 1: Hot: hi: 186, btch: 31 usd: 30 Cold: hi: 62, btch:
> 15 usd: 5
> CPU 2: Hot: hi: 186, btch: 31 usd: 93 Cold: hi: 62, btch:
> 15 usd: 8
> CPU 3: Hot: hi: 186, btch: 31 usd: 3 Cold: hi: 62, btch:
> 15 usd: 14
> CPU 4: Hot: hi: 186, btch: 31 usd: 37 Cold: hi: 62, btch:
> 15 usd: 53
> CPU 5: Hot: hi: 186, btch: 31 usd: 67 Cold: hi: 62, btch:
> 15 usd: 49
> CPU 6: Hot: hi: 186, btch: 31 usd: 15 Cold: hi: 62, btch:
> 15 usd: 30
> CPU 7: Hot: hi: 186, btch: 31 usd: 138 Cold: hi: 62, btch:
> 15 usd: 61
> Active:5136 inactive:8135705 dirty:12 writeback:0 unstable:0
> free:15715 slab:136280 mapped:2348 pagetables:164 bounce:0
> DMA free:3524kB min:68kB low:84kB high:100kB active:0kB inactive:0kB
> present:16256kB pages_scanned:0 all_unreclaimable? yes
> lowmem_reserve[]: 0 873 34020 34020
> Normal free:1368kB min:3744kB low:4680kB high:5616kB active:288kB
> inactive:252kB present:894080kB pages_scanned:32 all_unreclaimable? no
> lowmem_reserve[]: 0 0 265176 265176
> HighMem free:59588kB min:512kB low:36080kB high:71652kB active:20256kB
> inactive:32541032kB present:33942528kB pages_scanned:32
> all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> DMA: 2*4kB 4*8kB 4*16kB 4*32kB 5*64kB 1*128kB 3*256kB 0*512kB 0*1024kB
> 1*2048kB 0*4096kB = 3496kB
> Normal: 0*4kB 0*8kB 1*16kB 0*32kB 1*64kB 1*128kB 0*256kB 0*512kB
> 1*1024kB 0*2048kB 0*4096kB = 1232kB
> HighMem: 34*4kB 23*8kB 28*16kB 2*32kB 4*64kB 1*128kB 4*256kB 3*512kB
> 2*1024kB 5*2048kB 11*4096kB = 61120kB
> Swap cache: add 27, delete 27, find 1/2, race 0+0
> Free swap = 16386168kB
> Total swap = 16386260kB
> Free swap: 16386168kB
> 8781824 pages of RAM
> 8552448 pages of HIGHMEM
> 463201 reserved pages
> 8140201 pages shared
> 0 pages swap cached
> 12 pages dirty
> 0 pages writeback
> 2382 pages mapped
> 136255 pages slab
> 167 pages pagetables
> ib%d: failed to allocate receive buffer 144
>
> ----------
>
> How to provoke #2 (with ib_ipoib loaded):
>
> non-root$ dd if=/dev/zero of=/tmp/50G bs=1M count=50000
>
> This results in an OOM triggering the OOM-killer which starts killing
> processes.
>
> ----------
>
> Any help would be appreciated, as well as confirmation of the same
> sort of behavior.
>
> Thanks,
> John
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
More information about the general
mailing list