[ofa-general] OOM problem with ib_ipoib?
John Marshall
John.Marshall at ec.gc.ca
Thu Oct 23 07:01:52 PDT 2008
Is this the right list to be reporting this sort of issue?
Thanks,
John
John Marshall wrote:
> John Marshall wrote:
>> Hi,
>>
>> Summary: I believe I have been having an OOM problem caused by the
>> ib_ipoib module. I do not see the problem until it is
>> loaded. The problem manifests itself when the kernel cache
>> (grep Cached /proc/meminfo) containing file data is maxed
>> out. Normally, the cached data should be written out and
>> released by pdflush. In this case, it is not.
>>
>> Notes:
>> 1) it is NOT necessary for the ib interfaces to actually
>> be used or up!
>> 2) I am using ofed 1.3.2 which I have built on my own
>> machine.
>> 3) I have similar weird behavior when using 1.4-rc3
>> and a 2.6.26 kernel.
> An additional item: when rebuilt for the same 2.6.24 kernel
> as mentioned below, but without BIGMEM, I do not encounter
> the same problem.
>>
>> ----------
>>
>> System info:
>>
>> root# lsmod | grep ib
>> ib_ipoib 77512 0
>> ib_cm 33260 1 ib_ipoib
>> ib_sa 36628 2 ib_ipoib,ib_cm
>> ib_mthca 124832 0
>> ib_umad 16232 0
>> ib_uverbs 38792 0
>> ib_mad 35188 4 ib_cm,ib_sa,ib_mthca,ib_umad
>> ib_core 54304 7
>> ib_ipoib,ib_cm,ib_sa,ib_mthca,ib_umad,ib_uverbs,ib_mad
>> ipv6 242980 29 ib_ipoib
>> libata 145584 1 ata_generic
>> scsi_mod 142316 6
>> sg,sr_mod,usb_storage,sd_mod,megaraid_sas,libata
>>
>> root# uname -r
>> 2.6.24-etchnhalf.1-686-bigmem
>>
>> root# cat /proc/cpuinfo
>> processor : 0
>> vendor_id : AuthenticAMD
>> cpu family : 15
>> model : 65
>> model name : Dual-Core AMD Opteron(tm) Processor 8220
>> stepping : 3
>> cpu MHz : 2793.163
>> cache size : 1024 KB
>> physical id : 0
>> siblings : 2
>> core id : 0
>> cpu cores : 2
>> fdiv_bug : no
>> hlt_bug : no
>> f00f_bug : no
>> coma_bug : no
>> fpu : yes
>> fpu_exception : yes
>> cpuid level : 1
>> wp : yes
>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
>> pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
>> fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm
>> extapic cr8_legacy ts fid vid ttp tm stc
>> bogomips : 5589.70
>> clflush size : 64
>>
>> ***** 7 more similar entries (2 cpu, 4-core each) ****
>>
>> root# cat /proc/meminfo
>> cat /proc/meminfo
>> MemTotal: 33274492 kB
>> MemFree: 147716 kB
>> Buffers: 840 kB
>> Cached: 32532792 kB
>> SwapCached: 0 kB
>> Active: 19956 kB
>> Inactive: 32524692 kB
>> HighTotal: 32635808 kB
>> HighFree: 77008 kB
>> LowTotal: 638684 kB
>> LowFree: 70708 kB
>> SwapTotal: 16386260 kB
>> SwapFree: 16386168 kB
>> Dirty: 88 kB
>> Writeback: 0 kB
>> AnonPages: 11032 kB
>> Mapped: 7940 kB
>> Slab: 537012 kB
>> SReclaimable: 487100 kB
>> SUnreclaim: 49912 kB
>> PageTables: 656 kB
>> NFS_Unstable: 0 kB
>> Bounce: 0 kB
>> CommitLimit: 33023504 kB
>> Committed_AS: 61360 kB
>> VmallocTotal: 118776 kB
>> VmallocUsed: 96800 kB
>> VmallocChunk: 13112 kB
>> HugePages_Total: 0
>> HugePages_Free: 0
>> HugePages_Rsvd: 0
>> HugePages_Surp: 0
>> Hugepagesize: 2048 kB
>>
>> # dpkg -l |grep ofed
>> ii libibcm
>> 1.0.2-1 ofed-1.3.2: libibcm
>> ii libibcommon
>> 1.0.8-1 ofed-1.3.2: libibcommon
>> ii libibmad
>> 1.1.6-1 ofed-1.3.2: libibmad
>> ii libibumad
>> 1.1.7-1 ofed-1.3.2: libibumad
>> ii libibverbs
>> 1.1.1-1 ofed-1.3.2: libibverbs
>> ii libipathverbs
>> 1.1-1 ofed-1.3.2: libipathverbs
>> ii libmlx4
>> 1.0-1 ofed-1.3.2: libmlx
>> ii libmthca
>> 1.0.4-1 ofed-1.3.2: libmthca
>> ii librdmacm
>> 1.0.7-1 ofed-1.3.2: librdmacm
>> ii libsdp
>> 1.1.99-1 ofed-1.3.2: libsdp
>> ii ofa-kernel
>> 1.3.2-2.6.24-etchnhalf.1-686-bigmem-1 ofed-1.3.2: ofa_kernel
>>
>> ----------
>>
>> How to provoke #1 (prior to loading ib_ipoib):
>>
>> non-root$ dd if=/dev/zero of=/tmp/50G bs=1M count=50000
>>
>> root# modprobe ib_ipoib
>>
>> Output from dmesg:
>>
>> modprobe: page allocation failure. order:1, mode:0x20
>> Pid: 6839, comm: modprobe Not tainted 2.6.24-etchnhalf.1-686-bigmem #1
>> [<c0161904>] __alloc_pages+0x2c4/0x2d5
>> [<c017a05c>] cache_alloc_refill+0x299/0x4b1
>> [<c017a2e9>] __kmalloc+0x75/0xbc
>> [<c025eafb>] __alloc_skb+0x49/0xf5
>> [<f8d4677f>] ipoib_cm_alloc_rx_skb+0x31/0x218 [ib_ipoib]
>> [<f8d48c09>] ipoib_cm_dev_init+0x50c/0x552 [ib_ipoib]
>> [<c0249944>] dma_pool_free+0xb0/0x18c
>> [<f8d45bed>] ipoib_transport_dev_init+0xd2/0x3d1 [ib_ipoib]
>> [<f8d42c6d>] ipoib_ib_dev_init+0x2c/0x6e [ib_ipoib]
>> [<f8d3f7b3>] ipoib_dev_init+0xab/0xd0 [ib_ipoib]
>> [<f8d3f9f8>] ipoib_add_one+0x220/0x3cf [ib_ipoib]
>> [<c011fef8>] resched_task+0x52/0x54
>> [<f89b13e2>] ib_register_client+0x48/0x6c [ib_core]
>> [<f89890d2>] ipoib_init_module+0xd2/0xf8 [ib_ipoib]
>> [<c0145a27>] sys_init_module+0x15e3/0x16fb
>> [<c0166432>] vma_prio_tree_insert+0x17/0x2a
>> [<c017a274>] __kmalloc+0x0/0xbc
>> [<c0103ede>] syscall_call+0x7/0xb
>> =======================
>> Mem-info:
>> DMA per-cpu:
>> CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch:
>> 1 usd: 0
>> CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch:
>> 1 usd: 0
>> CPU 2: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch:
>> 1 usd: 0
>> CPU 3: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch:
>> 1 usd: 0
>> CPU 4: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch:
>> 1 usd: 0
>> CPU 5: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch:
>> 1 usd: 0
>> CPU 6: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch:
>> 1 usd: 0
>> CPU 7: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch:
>> 1 usd: 0
>> Normal per-cpu:
>> CPU 0: Hot: hi: 186, btch: 31 usd: 121 Cold: hi: 62, btch:
>> 15 usd: 58
>> CPU 1: Hot: hi: 186, btch: 31 usd: 42 Cold: hi: 62, btch:
>> 15 usd: 26
>> CPU 2: Hot: hi: 186, btch: 31 usd: 152 Cold: hi: 62, btch:
>> 15 usd: 57
>> CPU 3: Hot: hi: 186, btch: 31 usd: 63 Cold: hi: 62, btch:
>> 15 usd: 59
>> CPU 4: Hot: hi: 186, btch: 31 usd: 72 Cold: hi: 62, btch:
>> 15 usd: 55
>> CPU 5: Hot: hi: 186, btch: 31 usd: 174 Cold: hi: 62, btch:
>> 15 usd: 61
>> CPU 6: Hot: hi: 186, btch: 31 usd: 66 Cold: hi: 62, btch:
>> 15 usd: 48
>> CPU 7: Hot: hi: 186, btch: 31 usd: 35 Cold: hi: 62, btch:
>> 15 usd: 54
>> HighMem per-cpu:
>> CPU 0: Hot: hi: 186, btch: 31 usd: 31 Cold: hi: 62, btch:
>> 15 usd: 9
>> CPU 1: Hot: hi: 186, btch: 31 usd: 30 Cold: hi: 62, btch:
>> 15 usd: 5
>> CPU 2: Hot: hi: 186, btch: 31 usd: 93 Cold: hi: 62, btch:
>> 15 usd: 8
>> CPU 3: Hot: hi: 186, btch: 31 usd: 3 Cold: hi: 62, btch:
>> 15 usd: 14
>> CPU 4: Hot: hi: 186, btch: 31 usd: 37 Cold: hi: 62, btch:
>> 15 usd: 53
>> CPU 5: Hot: hi: 186, btch: 31 usd: 67 Cold: hi: 62, btch:
>> 15 usd: 49
>> CPU 6: Hot: hi: 186, btch: 31 usd: 15 Cold: hi: 62, btch:
>> 15 usd: 30
>> CPU 7: Hot: hi: 186, btch: 31 usd: 138 Cold: hi: 62, btch:
>> 15 usd: 61
>> Active:5136 inactive:8135705 dirty:12 writeback:0 unstable:0
>> free:15715 slab:136280 mapped:2348 pagetables:164 bounce:0
>> DMA free:3524kB min:68kB low:84kB high:100kB active:0kB inactive:0kB
>> present:16256kB pages_scanned:0 all_unreclaimable? yes
>> lowmem_reserve[]: 0 873 34020 34020
>> Normal free:1368kB min:3744kB low:4680kB high:5616kB active:288kB
>> inactive:252kB present:894080kB pages_scanned:32 all_unreclaimable? no
>> lowmem_reserve[]: 0 0 265176 265176
>> HighMem free:59588kB min:512kB low:36080kB high:71652kB
>> active:20256kB inactive:32541032kB present:33942528kB
>> pages_scanned:32 all_unreclaimable? no
>> lowmem_reserve[]: 0 0 0 0
>> DMA: 2*4kB 4*8kB 4*16kB 4*32kB 5*64kB 1*128kB 3*256kB 0*512kB
>> 0*1024kB 1*2048kB 0*4096kB = 3496kB
>> Normal: 0*4kB 0*8kB 1*16kB 0*32kB 1*64kB 1*128kB 0*256kB 0*512kB
>> 1*1024kB 0*2048kB 0*4096kB = 1232kB
>> HighMem: 34*4kB 23*8kB 28*16kB 2*32kB 4*64kB 1*128kB 4*256kB 3*512kB
>> 2*1024kB 5*2048kB 11*4096kB = 61120kB
>> Swap cache: add 27, delete 27, find 1/2, race 0+0
>> Free swap = 16386168kB
>> Total swap = 16386260kB
>> Free swap: 16386168kB
>> 8781824 pages of RAM
>> 8552448 pages of HIGHMEM
>> 463201 reserved pages
>> 8140201 pages shared
>> 0 pages swap cached
>> 12 pages dirty
>> 0 pages writeback
>> 2382 pages mapped
>> 136255 pages slab
>> 167 pages pagetables
>> ib%d: failed to allocate receive buffer 144
>>
>> ----------
>>
>> How to provoke #2 (with ib_ipoib loaded):
>>
>> non-root$ dd if=/dev/zero of=/tmp/50G bs=1M count=50000
>>
>> This results in an OOM triggering the OOM-killer which starts killing
>> processes.
>>
>> ----------
>>
>> Any help would be appreciated, as well as confirmation of the same
>> sort of behavior.
>>
>> Thanks,
>> John
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
More information about the general
mailing list