[openib-general] ibv_reg_mr fails: bad address [was: Re: Problem with mca_mpool_openib_register - Cannot allocate memory]
Bernhard Fischer
rep.nop at aon.at
Thu Jul 13 11:51:16 PDT 2006
On Mon, Jun 19, 2006 at 11:11:12AM -0400, Bill Wichser wrote:
>Running the openib stack from Redhat on a 2.6.9-34.ELsmp kernel, dual
>Xeon. Running with openmpi v1.0.2 compiled w/gcc.
>
>While we still have the problem with btl_openib_endpoint.c returning 0
>byte(s) for max inline data, and realize that another IB stack addresses
>this, another problem when running across more than a single host pops
>up generating huge amounts of error messages.
>
>The errors go something like this:
>
>mca_mpool_openib_register: ibv_reg_mr(0x2ac2622000,1052672) failed with
>error: Cannot allocate memory
>[0,1,1][btl_openib.c:496:mca_btl_openib_prepare_dst]
>mpool_register(0x2ac2622040,1048576) failed: base 0x2ac2222040 lb 0
>offset 4194304
while 8MB (as Bill stated below) is likely to fail, i'm getting:
[x86-64n001:07622] mca_mpool_openib_register:
ibv_reg_mr(0x717f2000,2113536) failed with error: Bad address
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x717f1ff8
two boxes involved, each with 2GB memory, my ulimits are ok:
$ ulimit -l;rsh 10.100.0.44 "ulimit -l"
unlimited
unlimited
Any hint on this one?
TIA,
Bernhard
$ cat /sys/class/infiniband/mthca0/fw_ver
4.7.400
# lspci -vvxxx -s 01:00.0
01:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex HCA (Tavor compatibility mode) (rev a0)
Subsystem: Mellanox Technologies MT25208 InfiniHost III Ex HCA (Tavor compatibility mode)
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size 10
Interrupt: pin A routed to IRQ 177
Region 0: Memory at 0000000090200000 (64-bit, non-prefetchable) [size=1M]
Region 2: Memory at 0000000088000000 (64-bit, prefetchable) [size=8M]
Region 4: Memory at 0000000080000000 (64-bit, prefetchable) [size=128M]
Capabilities: [40] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Vital Product Data
Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable-
Address: 0000000000000000 Data: 0000
Capabilities: [84] MSI-X: Enable+ Mask- TabSize=32
Vector table: BAR=0 offset=00082000
PBA: BAR=0 offset=00082200
Capabilities: [60] Express Endpoint IRQ 0
Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag-
Device: Latency L0s <64ns, L1 unlimited
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 8
Link: Latency L0s unlimited, L1 unlimited
Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
Link: Speed 2.5Gb/s, Width x8
00: b3 15 78 62 06 04 10 00 a0 00 06 0c 10 00 00 00
10: 04 00 20 90 00 00 00 00 0c 00 00 88 00 00 00 00
20: 0c 00 00 80 00 00 00 00 00 00 00 00 b3 15 78 62
30: 00 00 00 00 40 00 00 00 00 00 00 00 09 01 00 00
40: 01 48 02 00 00 00 00 00 03 90 ff 7f 11 11 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 10 00 01 00 00 0e 64 00 00 20 00 00 81 f4 03 08
70: 00 00 41 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 11 60 1f 80 00 20 08 00 00 22 08 00
90: 05 84 8a 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>
>We fixed the /etc/security/limits.conf problem but I don't know what to
>do about this one. The job seems to complete without error on 2 nodes
>(4 processors) but to scale any larger just generates megabyte files of
>these types of error messages.
>
>Any insights for this problem? All searches lead me to the limits.conf
>which we have set to 8192. These are 8G machines if that makes any
Bill, 8192 is just 8MB.
See http://www.open-mpi.org/faq/?category=infiniband#ib-locked-pages
and also make sure to have
session required pam_limits.so
in your rsh, rlogin and rexec files in pam.d
>difference.
>
>Thanks,
>Bill
More information about the general
mailing list