[openib-general] ibv_reg_mr fails: bad address [was: Re: Problem with mca_mpool_openib_register - Cannot allocate memory]

Bernhard Fischer rep.nop at aon.at
Thu Jul 13 11:51:16 PDT 2006


On Mon, Jun 19, 2006 at 11:11:12AM -0400, Bill Wichser wrote:
>Running the openib stack from Redhat on a 2.6.9-34.ELsmp kernel, dual 
>Xeon.  Running with openmpi v1.0.2 compiled w/gcc.
>
>While we still have the problem with btl_openib_endpoint.c returning  0 
>byte(s) for max inline data, and realize that another IB stack addresses 
>this, another problem when running across more than a single host pops 
>up generating huge amounts of error messages.
>
>The errors go something like this:
>
>mca_mpool_openib_register: ibv_reg_mr(0x2ac2622000,1052672) failed with 
>error: Cannot allocate memory
>[0,1,1][btl_openib.c:496:mca_btl_openib_prepare_dst] 
>mpool_register(0x2ac2622040,1048576) failed: base 0x2ac2222040 lb 0 
>offset 4194304

while 8MB (as Bill stated below) is likely to fail, i'm getting:

[x86-64n001:07622] mca_mpool_openib_register:
ibv_reg_mr(0x717f2000,2113536) failed with error: Bad address
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x717f1ff8

two boxes involved, each with 2GB memory, my ulimits are ok:
$ ulimit -l;rsh 10.100.0.44 "ulimit -l"
unlimited
unlimited

Any hint on this one?
TIA,
Bernhard

$ cat /sys/class/infiniband/mthca0/fw_ver 
4.7.400


# lspci -vvxxx -s 01:00.0
01:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex HCA (Tavor compatibility mode) (rev a0)
        Subsystem: Mellanox Technologies MT25208 InfiniHost III Ex HCA (Tavor compatibility mode)
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 0, Cache Line Size 10
        Interrupt: pin A routed to IRQ 177
        Region 0: Memory at 0000000090200000 (64-bit, non-prefetchable) [size=1M]
        Region 2: Memory at 0000000088000000 (64-bit, prefetchable) [size=8M]
        Region 4: Memory at 0000000080000000 (64-bit, prefetchable) [size=128M]
        Capabilities: [40] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [48] Vital Product Data
        Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable-
                Address: 0000000000000000  Data: 0000
        Capabilities: [84] MSI-X: Enable+ Mask- TabSize=32
                Vector table: BAR=0 offset=00082000
                PBA: BAR=0 offset=00082200
        Capabilities: [60] Express Endpoint IRQ 0
                Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag-
                Device: Latency L0s <64ns, L1 unlimited
                Device: AtnBtn- AtnInd- PwrInd-
                Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
                Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
                Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 8
                Link: Latency L0s unlimited, L1 unlimited
                Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
                Link: Speed 2.5Gb/s, Width x8
00: b3 15 78 62 06 04 10 00 a0 00 06 0c 10 00 00 00
10: 04 00 20 90 00 00 00 00 0c 00 00 88 00 00 00 00
20: 0c 00 00 80 00 00 00 00 00 00 00 00 b3 15 78 62
30: 00 00 00 00 40 00 00 00 00 00 00 00 09 01 00 00
40: 01 48 02 00 00 00 00 00 03 90 ff 7f 11 11 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 10 00 01 00 00 0e 64 00 00 20 00 00 81 f4 03 08
70: 00 00 41 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 11 60 1f 80 00 20 08 00 00 22 08 00
90: 05 84 8a 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00


>
>We fixed the /etc/security/limits.conf problem but I don't know what to 
>do about this one.  The job seems to complete without error on 2 nodes 
>(4 processors) but to scale any larger just generates megabyte files of 
>these types of error messages.
>
>Any insights for this problem?  All searches lead me to the limits.conf 
>which we have set to 8192.  These are 8G machines if that makes any 

Bill, 8192 is just 8MB.
See http://www.open-mpi.org/faq/?category=infiniband#ib-locked-pages
and also make sure to have
session  required        pam_limits.so
in your rsh, rlogin and rexec files in pam.d

>difference.
>
>Thanks,
>Bill




More information about the general mailing list