[ofa-general] OOM problem with ib_ipoib?

John Marshall John.Marshall at ec.gc.ca
Wed Oct 22 15:16:13 PDT 2008


John Marshall wrote:
> Hi,
>
> Summary: I believe I have been having an OOM problem caused by the
>     ib_ipoib module. I do not see the problem until it is
>     loaded. The problem manifests itself when the kernel cache
>     (grep Cached /proc/meminfo) containing file data is maxed
>     out. Normally, the cached data should be written out and
>     released by pdflush. In this case, it is not.
>
>     Notes:
>     1) it is NOT necessary for the ib interfaces to actually
>     be used or up!
>     2) I am using ofed 1.3.2 which I have built on my own
>     machine.
>     3) I have similar weird behavior when using 1.4-rc3
>     and a 2.6.26 kernel.
An additional item: when rebuilt for the same 2.6.24 kernel
    as mentioned below, but without BIGMEM, I do not encounter
    the same problem.
>
> ----------
>
> System info:
>
> root# lsmod | grep ib
> ib_ipoib               77512  0
> ib_cm                  33260  1 ib_ipoib
> ib_sa                  36628  2 ib_ipoib,ib_cm
> ib_mthca              124832  0
> ib_umad                16232  0
> ib_uverbs              38792  0
> ib_mad                 35188  4 ib_cm,ib_sa,ib_mthca,ib_umad
> ib_core                54304  7 
> ib_ipoib,ib_cm,ib_sa,ib_mthca,ib_umad,ib_uverbs,ib_mad
> ipv6                  242980  29 ib_ipoib
> libata                145584  1 ata_generic
> scsi_mod              142316  6 
> sg,sr_mod,usb_storage,sd_mod,megaraid_sas,libata
>
> root# uname -r
> 2.6.24-etchnhalf.1-686-bigmem
>
> root# cat /proc/cpuinfo
> processor       : 0
> vendor_id       : AuthenticAMD
> cpu family      : 15
> model           : 65
> model name      : Dual-Core AMD Opteron(tm) Processor 8220
> stepping        : 3
> cpu MHz         : 2793.163
> cache size      : 1024 KB
> physical id     : 0
> siblings        : 2
> core id         : 0
> cpu cores       : 2
> fdiv_bug        : no
> hlt_bug         : no
> f00f_bug        : no
> coma_bug        : no
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 1
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
> mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext 
> fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm 
> extapic cr8_legacy ts fid vid ttp tm stc
> bogomips        : 5589.70
> clflush size    : 64
>
> ***** 7 more similar entries (2 cpu, 4-core each) ****
>
> root# cat /proc/meminfo
> cat /proc/meminfo
> MemTotal:     33274492 kB
> MemFree:        147716 kB
> Buffers:           840 kB
> Cached:       32532792 kB
> SwapCached:          0 kB
> Active:          19956 kB
> Inactive:     32524692 kB
> HighTotal:    32635808 kB
> HighFree:        77008 kB
> LowTotal:       638684 kB
> LowFree:         70708 kB
> SwapTotal:    16386260 kB
> SwapFree:     16386168 kB
> Dirty:              88 kB
> Writeback:           0 kB
> AnonPages:       11032 kB
> Mapped:           7940 kB
> Slab:           537012 kB
> SReclaimable:   487100 kB
> SUnreclaim:      49912 kB
> PageTables:        656 kB
> NFS_Unstable:        0 kB
> Bounce:              0 kB
> CommitLimit:  33023504 kB
> Committed_AS:    61360 kB
> VmallocTotal:   118776 kB
> VmallocUsed:     96800 kB
> VmallocChunk:    13112 kB
> HugePages_Total:     0
> HugePages_Free:      0
> HugePages_Rsvd:      0
> HugePages_Surp:      0
> Hugepagesize:     2048 kB
>
> # dpkg -l |grep ofed
> ii  libibcm                                     
> 1.0.2-1                                  ofed-1.3.2: libibcm
> ii  libibcommon                                 
> 1.0.8-1                                  ofed-1.3.2: libibcommon
> ii  libibmad                                    
> 1.1.6-1                                  ofed-1.3.2: libibmad
> ii  libibumad                                   
> 1.1.7-1                                  ofed-1.3.2: libibumad
> ii  libibverbs                                  
> 1.1.1-1                                  ofed-1.3.2: libibverbs
> ii  libipathverbs                               
> 1.1-1                                    ofed-1.3.2: libipathverbs
> ii  libmlx4                                     
> 1.0-1                                    ofed-1.3.2: libmlx
> ii  libmthca                                    
> 1.0.4-1                                  ofed-1.3.2: libmthca
> ii  librdmacm                                   
> 1.0.7-1                                  ofed-1.3.2: librdmacm
> ii  libsdp                                      
> 1.1.99-1                                 ofed-1.3.2: libsdp
> ii  ofa-kernel                                  
> 1.3.2-2.6.24-etchnhalf.1-686-bigmem-1    ofed-1.3.2: ofa_kernel
>
> ----------
>
> How to provoke #1 (prior to loading ib_ipoib):
>
> non-root$ dd if=/dev/zero of=/tmp/50G bs=1M count=50000
>
> root# modprobe ib_ipoib
>
> Output from dmesg:
>
> modprobe: page allocation failure. order:1, mode:0x20
> Pid: 6839, comm: modprobe Not tainted 2.6.24-etchnhalf.1-686-bigmem #1
> [<c0161904>] __alloc_pages+0x2c4/0x2d5
> [<c017a05c>] cache_alloc_refill+0x299/0x4b1
> [<c017a2e9>] __kmalloc+0x75/0xbc
> [<c025eafb>] __alloc_skb+0x49/0xf5
> [<f8d4677f>] ipoib_cm_alloc_rx_skb+0x31/0x218 [ib_ipoib]
> [<f8d48c09>] ipoib_cm_dev_init+0x50c/0x552 [ib_ipoib]
> [<c0249944>] dma_pool_free+0xb0/0x18c
> [<f8d45bed>] ipoib_transport_dev_init+0xd2/0x3d1 [ib_ipoib]
> [<f8d42c6d>] ipoib_ib_dev_init+0x2c/0x6e [ib_ipoib]
> [<f8d3f7b3>] ipoib_dev_init+0xab/0xd0 [ib_ipoib]
> [<f8d3f9f8>] ipoib_add_one+0x220/0x3cf [ib_ipoib]
> [<c011fef8>] resched_task+0x52/0x54
> [<f89b13e2>] ib_register_client+0x48/0x6c [ib_core]
> [<f89890d2>] ipoib_init_module+0xd2/0xf8 [ib_ipoib]
> [<c0145a27>] sys_init_module+0x15e3/0x16fb
> [<c0166432>] vma_prio_tree_insert+0x17/0x2a
> [<c017a274>] __kmalloc+0x0/0xbc
> [<c0103ede>] syscall_call+0x7/0xb
> =======================
> Mem-info:
> DMA per-cpu:
> CPU    0: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   
> 1 usd:   0
> CPU    1: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   
> 1 usd:   0
> CPU    2: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   
> 1 usd:   0
> CPU    3: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   
> 1 usd:   0
> CPU    4: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   
> 1 usd:   0
> CPU    5: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   
> 1 usd:   0
> CPU    6: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   
> 1 usd:   0
> CPU    7: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   
> 1 usd:   0
> Normal per-cpu:
> CPU    0: Hot: hi:  186, btch:  31 usd: 121   Cold: hi:   62, btch:  
> 15 usd:  58
> CPU    1: Hot: hi:  186, btch:  31 usd:  42   Cold: hi:   62, btch:  
> 15 usd:  26
> CPU    2: Hot: hi:  186, btch:  31 usd: 152   Cold: hi:   62, btch:  
> 15 usd:  57
> CPU    3: Hot: hi:  186, btch:  31 usd:  63   Cold: hi:   62, btch:  
> 15 usd:  59
> CPU    4: Hot: hi:  186, btch:  31 usd:  72   Cold: hi:   62, btch:  
> 15 usd:  55
> CPU    5: Hot: hi:  186, btch:  31 usd: 174   Cold: hi:   62, btch:  
> 15 usd:  61
> CPU    6: Hot: hi:  186, btch:  31 usd:  66   Cold: hi:   62, btch:  
> 15 usd:  48
> CPU    7: Hot: hi:  186, btch:  31 usd:  35   Cold: hi:   62, btch:  
> 15 usd:  54
> HighMem per-cpu:
> CPU    0: Hot: hi:  186, btch:  31 usd:  31   Cold: hi:   62, btch:  
> 15 usd:   9
> CPU    1: Hot: hi:  186, btch:  31 usd:  30   Cold: hi:   62, btch:  
> 15 usd:   5
> CPU    2: Hot: hi:  186, btch:  31 usd:  93   Cold: hi:   62, btch:  
> 15 usd:   8
> CPU    3: Hot: hi:  186, btch:  31 usd:   3   Cold: hi:   62, btch:  
> 15 usd:  14
> CPU    4: Hot: hi:  186, btch:  31 usd:  37   Cold: hi:   62, btch:  
> 15 usd:  53
> CPU    5: Hot: hi:  186, btch:  31 usd:  67   Cold: hi:   62, btch:  
> 15 usd:  49
> CPU    6: Hot: hi:  186, btch:  31 usd:  15   Cold: hi:   62, btch:  
> 15 usd:  30
> CPU    7: Hot: hi:  186, btch:  31 usd: 138   Cold: hi:   62, btch:  
> 15 usd:  61
> Active:5136 inactive:8135705 dirty:12 writeback:0 unstable:0
> free:15715 slab:136280 mapped:2348 pagetables:164 bounce:0
> DMA free:3524kB min:68kB low:84kB high:100kB active:0kB inactive:0kB 
> present:16256kB pages_scanned:0 all_unreclaimable? yes
> lowmem_reserve[]: 0 873 34020 34020
> Normal free:1368kB min:3744kB low:4680kB high:5616kB active:288kB 
> inactive:252kB present:894080kB pages_scanned:32 all_unreclaimable? no
> lowmem_reserve[]: 0 0 265176 265176
> HighMem free:59588kB min:512kB low:36080kB high:71652kB active:20256kB 
> inactive:32541032kB present:33942528kB pages_scanned:32 
> all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> DMA: 2*4kB 4*8kB 4*16kB 4*32kB 5*64kB 1*128kB 3*256kB 0*512kB 0*1024kB 
> 1*2048kB 0*4096kB = 3496kB
> Normal: 0*4kB 0*8kB 1*16kB 0*32kB 1*64kB 1*128kB 0*256kB 0*512kB 
> 1*1024kB 0*2048kB 0*4096kB = 1232kB
> HighMem: 34*4kB 23*8kB 28*16kB 2*32kB 4*64kB 1*128kB 4*256kB 3*512kB 
> 2*1024kB 5*2048kB 11*4096kB = 61120kB
> Swap cache: add 27, delete 27, find 1/2, race 0+0
> Free swap  = 16386168kB
> Total swap = 16386260kB
> Free swap:       16386168kB
> 8781824 pages of RAM
> 8552448 pages of HIGHMEM
> 463201 reserved pages
> 8140201 pages shared
> 0 pages swap cached
> 12 pages dirty
> 0 pages writeback
> 2382 pages mapped
> 136255 pages slab
> 167 pages pagetables
> ib%d: failed to allocate receive buffer 144
>
> ----------
>
> How to provoke #2 (with ib_ipoib loaded):
>
> non-root$ dd if=/dev/zero of=/tmp/50G bs=1M count=50000
>
> This results in an OOM triggering the OOM-killer which starts killing
> processes.
>
> ----------
>
> Any help would be appreciated, as well as confirmation of the same
> sort of behavior.
>
> Thanks,
> John
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general




More information about the general mailing list