[ofa-general] OOM problem with ib_ipoib?

John Marshall John.Marshall at ec.gc.ca
Thu Oct 23 07:01:52 PDT 2008


Is this the right list to be reporting this sort of issue?

Thanks,
John


John Marshall wrote:
> John Marshall wrote:
>> Hi,
>>
>> Summary: I believe I have been having an OOM problem caused by the
>>     ib_ipoib module. I do not see the problem until it is
>>     loaded. The problem manifests itself when the kernel cache
>>     (grep Cached /proc/meminfo) containing file data is maxed
>>     out. Normally, the cached data should be written out and
>>     released by pdflush. In this case, it is not.
>>
>>     Notes:
>>     1) it is NOT necessary for the ib interfaces to actually
>>     be used or up!
>>     2) I am using ofed 1.3.2 which I have built on my own
>>     machine.
>>     3) I have similar weird behavior when using 1.4-rc3
>>     and a 2.6.26 kernel.
> An additional item: when rebuilt for the same 2.6.24 kernel
>    as mentioned below, but without BIGMEM, I do not encounter
>    the same problem.
>>
>> ----------
>>
>> System info:
>>
>> root# lsmod | grep ib
>> ib_ipoib               77512  0
>> ib_cm                  33260  1 ib_ipoib
>> ib_sa                  36628  2 ib_ipoib,ib_cm
>> ib_mthca              124832  0
>> ib_umad                16232  0
>> ib_uverbs              38792  0
>> ib_mad                 35188  4 ib_cm,ib_sa,ib_mthca,ib_umad
>> ib_core                54304  7 
>> ib_ipoib,ib_cm,ib_sa,ib_mthca,ib_umad,ib_uverbs,ib_mad
>> ipv6                  242980  29 ib_ipoib
>> libata                145584  1 ata_generic
>> scsi_mod              142316  6 
>> sg,sr_mod,usb_storage,sd_mod,megaraid_sas,libata
>>
>> root# uname -r
>> 2.6.24-etchnhalf.1-686-bigmem
>>
>> root# cat /proc/cpuinfo
>> processor       : 0
>> vendor_id       : AuthenticAMD
>> cpu family      : 15
>> model           : 65
>> model name      : Dual-Core AMD Opteron(tm) Processor 8220
>> stepping        : 3
>> cpu MHz         : 2793.163
>> cache size      : 1024 KB
>> physical id     : 0
>> siblings        : 2
>> core id         : 0
>> cpu cores       : 2
>> fdiv_bug        : no
>> hlt_bug         : no
>> f00f_bug        : no
>> coma_bug        : no
>> fpu             : yes
>> fpu_exception   : yes
>> cpuid level     : 1
>> wp              : yes
>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr 
>> pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext 
>> fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm 
>> extapic cr8_legacy ts fid vid ttp tm stc
>> bogomips        : 5589.70
>> clflush size    : 64
>>
>> ***** 7 more similar entries (2 cpu, 4-core each) ****
>>
>> root# cat /proc/meminfo
>> cat /proc/meminfo
>> MemTotal:     33274492 kB
>> MemFree:        147716 kB
>> Buffers:           840 kB
>> Cached:       32532792 kB
>> SwapCached:          0 kB
>> Active:          19956 kB
>> Inactive:     32524692 kB
>> HighTotal:    32635808 kB
>> HighFree:        77008 kB
>> LowTotal:       638684 kB
>> LowFree:         70708 kB
>> SwapTotal:    16386260 kB
>> SwapFree:     16386168 kB
>> Dirty:              88 kB
>> Writeback:           0 kB
>> AnonPages:       11032 kB
>> Mapped:           7940 kB
>> Slab:           537012 kB
>> SReclaimable:   487100 kB
>> SUnreclaim:      49912 kB
>> PageTables:        656 kB
>> NFS_Unstable:        0 kB
>> Bounce:              0 kB
>> CommitLimit:  33023504 kB
>> Committed_AS:    61360 kB
>> VmallocTotal:   118776 kB
>> VmallocUsed:     96800 kB
>> VmallocChunk:    13112 kB
>> HugePages_Total:     0
>> HugePages_Free:      0
>> HugePages_Rsvd:      0
>> HugePages_Surp:      0
>> Hugepagesize:     2048 kB
>>
>> # dpkg -l |grep ofed
>> ii  libibcm                                     
>> 1.0.2-1                                  ofed-1.3.2: libibcm
>> ii  libibcommon                                 
>> 1.0.8-1                                  ofed-1.3.2: libibcommon
>> ii  libibmad                                    
>> 1.1.6-1                                  ofed-1.3.2: libibmad
>> ii  libibumad                                   
>> 1.1.7-1                                  ofed-1.3.2: libibumad
>> ii  libibverbs                                  
>> 1.1.1-1                                  ofed-1.3.2: libibverbs
>> ii  libipathverbs                               
>> 1.1-1                                    ofed-1.3.2: libipathverbs
>> ii  libmlx4                                     
>> 1.0-1                                    ofed-1.3.2: libmlx
>> ii  libmthca                                    
>> 1.0.4-1                                  ofed-1.3.2: libmthca
>> ii  librdmacm                                   
>> 1.0.7-1                                  ofed-1.3.2: librdmacm
>> ii  libsdp                                      
>> 1.1.99-1                                 ofed-1.3.2: libsdp
>> ii  ofa-kernel                                  
>> 1.3.2-2.6.24-etchnhalf.1-686-bigmem-1    ofed-1.3.2: ofa_kernel
>>
>> ----------
>>
>> How to provoke #1 (prior to loading ib_ipoib):
>>
>> non-root$ dd if=/dev/zero of=/tmp/50G bs=1M count=50000
>>
>> root# modprobe ib_ipoib
>>
>> Output from dmesg:
>>
>> modprobe: page allocation failure. order:1, mode:0x20
>> Pid: 6839, comm: modprobe Not tainted 2.6.24-etchnhalf.1-686-bigmem #1
>> [<c0161904>] __alloc_pages+0x2c4/0x2d5
>> [<c017a05c>] cache_alloc_refill+0x299/0x4b1
>> [<c017a2e9>] __kmalloc+0x75/0xbc
>> [<c025eafb>] __alloc_skb+0x49/0xf5
>> [<f8d4677f>] ipoib_cm_alloc_rx_skb+0x31/0x218 [ib_ipoib]
>> [<f8d48c09>] ipoib_cm_dev_init+0x50c/0x552 [ib_ipoib]
>> [<c0249944>] dma_pool_free+0xb0/0x18c
>> [<f8d45bed>] ipoib_transport_dev_init+0xd2/0x3d1 [ib_ipoib]
>> [<f8d42c6d>] ipoib_ib_dev_init+0x2c/0x6e [ib_ipoib]
>> [<f8d3f7b3>] ipoib_dev_init+0xab/0xd0 [ib_ipoib]
>> [<f8d3f9f8>] ipoib_add_one+0x220/0x3cf [ib_ipoib]
>> [<c011fef8>] resched_task+0x52/0x54
>> [<f89b13e2>] ib_register_client+0x48/0x6c [ib_core]
>> [<f89890d2>] ipoib_init_module+0xd2/0xf8 [ib_ipoib]
>> [<c0145a27>] sys_init_module+0x15e3/0x16fb
>> [<c0166432>] vma_prio_tree_insert+0x17/0x2a
>> [<c017a274>] __kmalloc+0x0/0xbc
>> [<c0103ede>] syscall_call+0x7/0xb
>> =======================
>> Mem-info:
>> DMA per-cpu:
>> CPU    0: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   
>> 1 usd:   0
>> CPU    1: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   
>> 1 usd:   0
>> CPU    2: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   
>> 1 usd:   0
>> CPU    3: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   
>> 1 usd:   0
>> CPU    4: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   
>> 1 usd:   0
>> CPU    5: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   
>> 1 usd:   0
>> CPU    6: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   
>> 1 usd:   0
>> CPU    7: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   
>> 1 usd:   0
>> Normal per-cpu:
>> CPU    0: Hot: hi:  186, btch:  31 usd: 121   Cold: hi:   62, btch:  
>> 15 usd:  58
>> CPU    1: Hot: hi:  186, btch:  31 usd:  42   Cold: hi:   62, btch:  
>> 15 usd:  26
>> CPU    2: Hot: hi:  186, btch:  31 usd: 152   Cold: hi:   62, btch:  
>> 15 usd:  57
>> CPU    3: Hot: hi:  186, btch:  31 usd:  63   Cold: hi:   62, btch:  
>> 15 usd:  59
>> CPU    4: Hot: hi:  186, btch:  31 usd:  72   Cold: hi:   62, btch:  
>> 15 usd:  55
>> CPU    5: Hot: hi:  186, btch:  31 usd: 174   Cold: hi:   62, btch:  
>> 15 usd:  61
>> CPU    6: Hot: hi:  186, btch:  31 usd:  66   Cold: hi:   62, btch:  
>> 15 usd:  48
>> CPU    7: Hot: hi:  186, btch:  31 usd:  35   Cold: hi:   62, btch:  
>> 15 usd:  54
>> HighMem per-cpu:
>> CPU    0: Hot: hi:  186, btch:  31 usd:  31   Cold: hi:   62, btch:  
>> 15 usd:   9
>> CPU    1: Hot: hi:  186, btch:  31 usd:  30   Cold: hi:   62, btch:  
>> 15 usd:   5
>> CPU    2: Hot: hi:  186, btch:  31 usd:  93   Cold: hi:   62, btch:  
>> 15 usd:   8
>> CPU    3: Hot: hi:  186, btch:  31 usd:   3   Cold: hi:   62, btch:  
>> 15 usd:  14
>> CPU    4: Hot: hi:  186, btch:  31 usd:  37   Cold: hi:   62, btch:  
>> 15 usd:  53
>> CPU    5: Hot: hi:  186, btch:  31 usd:  67   Cold: hi:   62, btch:  
>> 15 usd:  49
>> CPU    6: Hot: hi:  186, btch:  31 usd:  15   Cold: hi:   62, btch:  
>> 15 usd:  30
>> CPU    7: Hot: hi:  186, btch:  31 usd: 138   Cold: hi:   62, btch:  
>> 15 usd:  61
>> Active:5136 inactive:8135705 dirty:12 writeback:0 unstable:0
>> free:15715 slab:136280 mapped:2348 pagetables:164 bounce:0
>> DMA free:3524kB min:68kB low:84kB high:100kB active:0kB inactive:0kB 
>> present:16256kB pages_scanned:0 all_unreclaimable? yes
>> lowmem_reserve[]: 0 873 34020 34020
>> Normal free:1368kB min:3744kB low:4680kB high:5616kB active:288kB 
>> inactive:252kB present:894080kB pages_scanned:32 all_unreclaimable? no
>> lowmem_reserve[]: 0 0 265176 265176
>> HighMem free:59588kB min:512kB low:36080kB high:71652kB 
>> active:20256kB inactive:32541032kB present:33942528kB 
>> pages_scanned:32 all_unreclaimable? no
>> lowmem_reserve[]: 0 0 0 0
>> DMA: 2*4kB 4*8kB 4*16kB 4*32kB 5*64kB 1*128kB 3*256kB 0*512kB 
>> 0*1024kB 1*2048kB 0*4096kB = 3496kB
>> Normal: 0*4kB 0*8kB 1*16kB 0*32kB 1*64kB 1*128kB 0*256kB 0*512kB 
>> 1*1024kB 0*2048kB 0*4096kB = 1232kB
>> HighMem: 34*4kB 23*8kB 28*16kB 2*32kB 4*64kB 1*128kB 4*256kB 3*512kB 
>> 2*1024kB 5*2048kB 11*4096kB = 61120kB
>> Swap cache: add 27, delete 27, find 1/2, race 0+0
>> Free swap  = 16386168kB
>> Total swap = 16386260kB
>> Free swap:       16386168kB
>> 8781824 pages of RAM
>> 8552448 pages of HIGHMEM
>> 463201 reserved pages
>> 8140201 pages shared
>> 0 pages swap cached
>> 12 pages dirty
>> 0 pages writeback
>> 2382 pages mapped
>> 136255 pages slab
>> 167 pages pagetables
>> ib%d: failed to allocate receive buffer 144
>>
>> ----------
>>
>> How to provoke #2 (with ib_ipoib loaded):
>>
>> non-root$ dd if=/dev/zero of=/tmp/50G bs=1M count=50000
>>
>> This results in an OOM triggering the OOM-killer which starts killing
>> processes.
>>
>> ----------
>>
>> Any help would be appreciated, as well as confirmation of the same
>> sort of behavior.
>>
>> Thanks,
>> John
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit 
>> http://openib.org/mailman/listinfo/openib-general
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general




More information about the general mailing list