[openib-general] ibv_reg_mr failure with pvfs on ehca?
Troy Benjegerdes
troy at scl.ameslab.gov
Mon Oct 23 17:08:16 PDT 2006
On Oct 23, 2006, at 8:42 AM, Hoang-Nam Nguyen wrote:
> Hello Troy!
>> The netpipe code is available with mercurial by:
>> hg clone http://source.scl.ameslab.gov/hg/netpipe3-pvfs-dev
>> Once you have pvfs2-1.5.1 installed, you should be able to do 'make
>> pvfs' in the netpipe3-pvfs-dev directory and build NPpvfs.
>> The command line arguments I used to reproduce this were:
>> ./NPpvfs -d $PVFS_FILE_PATH -l 32768 -u 268435456 -n 100 -o
>> $NETPIPE_OUTPUT_FILE
> Did you compile pvfs and NPpvfs as 32-bit or 64-bit libs/execs?
> I did compile pvfs and NPpvfs as is and realized that pvfs is built
> by default as 32-bit and NPpvfs as 64-bit. Hence NPpvfs complained
> to find incompatible pvfs libs.
> Regards
> Nam
>
I wasn't able to get reliable backtraces out of a 64 bit NPpvfs and
pvfs libs, so I rebuilt as 32 bit, and now I get much more
interesting errors and kernel logs..
If I start 4 netpipe processes on the same node with:
./NPpvfs -l 32768 -u 268435456 -n 100 -o results/proc2.w.out -I -d /
pvfs2/6node/proc2
I get errors like:
27: 786429 bytes 100 times --> 2249.96 Mbps in 2666.70 usec
28: 786432 bytes 100 times --> [E 18:47:20.394586] Error:
ib_check_cq: entry id 0x100ac7f0 opcode RDMA WRITE error
IBV_WC_LOC_PROT_ERR.
[E 18:47:20.395051] [bt] ./NPpvfs(error+0x9c) [0x1005858c]
[E 18:47:20.395087] [bt] ./NPpvfs [0x10056a00]
[E 18:47:20.395118] [bt] ./NPpvfs [0x1005726c]
And kernel logs like this:
Oct 23 18:48:37 p5l8 kernel: PU0007 00060066:print_error_data
HCAD_ERROR QP 0xdfe (resource=2000000000000dfe) has errors.
Oct 23 18:48:37 p5l8 kernel: PU0007 00060077:print_error_data
HCAD_ERROR Error data is available: 2000000000000dfe.
Oct 23 18:48:37 p5l8 kernel: PU0007 00060079:print_error_data
HCAD_ERROR EHCA ----- error data begin
---------------------------------------------------
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f000 ofs=0000
00000000000004d0 2000000000000dfe
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f010 ofs=0010
0100000000000310 8000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f020 ofs=0020
a000000500000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f030 ofs=0030
0000000001000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f040 ofs=0040
0000000000000001 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f050 ofs=0050
0000000000000014 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f060 ofs=0060
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f070 ofs=0070
000000000000ffff 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f080 ofs=0080
008000000000262b 0000000000ffffff
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f090 ofs=0090
0000000000ffffff 0000000009f49900
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f0a0 ofs=00a0
00000000000e0492 000000000000000a
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f0b0 ofs=00b0
0000000000000001 000000000000002b
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f0c0 ofs=00c0
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f0d0 ofs=00d0
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f0e0 ofs=00e0
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f0f0 ofs=00f0
0000000000000000 0000000000000003
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f100 ofs=0100
000000000000001a 0000000000000004
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f110 ofs=0110
0000000000000004 0000000000000032
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f120 ofs=0120
00000000dc9d4600 0000000003c32f28
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f130 ofs=0130
000000000009f4aa 000000000009f4aa
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f140 ofs=0140
0a00000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f150 ofs=0150
0000000000000002 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f160 ofs=0160
0000000000002633 000000000000262c
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f170 ofs=0170
0000000000000001 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f180 ofs=0180
0000000000000006 0000000000000004
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f190 ofs=0190
0000000000000004 00000001da05023d
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f1a0 ofs=01a0
000000000000001f 000000000000262b
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f1b0 ofs=01b0
00000000dc9d4600 0000000003c32f28
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f1c0 ofs=01c0
0000000000000001 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f1d0 ofs=01d0
00000000dc9e5600 0000000003c33328
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f1e0 ofs=01e0
0000000000000006 0000000000000001
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f1f0 ofs=01f0
0000000000000003 000000000000262c
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f200 ofs=0200
000000000009f499 0000000000000004
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f210 ofs=0210
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f220 ofs=0220
0000000000000000 0000003000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f230 ofs=0230
0000000000000002 000000000000262b
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f240 ofs=0240
0000000000000000 000000000009f4a9
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f250 ofs=0250
00000000e3e9f820 0000000000000106
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f260 ofs=0260
0000000000000106 0000000000000003
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f270 ofs=0270
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f280 ofs=0280
008000000000262b 0000000000ffffff
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f290 ofs=0290
0000000000ffffff 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f2a0 ofs=02a0
000000000000262c 8000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f2b0 ofs=02b0
09f22a0000000000 3808000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f2c0 ofs=02c0
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f2d0 ofs=02d0
0000000000000000 2000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f2e0 ofs=02e0
8000000000000000 3808000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f2f0 ofs=02f0
0000000000000000 6800000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f300 ofs=0300
a800000000000000 0000003000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f310 ofs=0310
4000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f320 ofs=0320
0000000000000000 02000000000000c8
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f330 ofs=0330
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f340 ofs=0340
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f350 ofs=0350
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f360 ofs=0360
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f370 ofs=0370
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f380 ofs=0380
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f390 ofs=0390
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f3a0 ofs=03a0
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f3b0 ofs=03b0
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f3c0 ofs=03c0
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f3d0 ofs=03d0
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f3e0 ofs=03e0
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f3f0 ofs=03f0
0000000000000000 0400000000000060
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f400 ofs=0400
8000000000000000 c000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f410 ofs=0410
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f420 ofs=0420
0000000003c2a383 00000000d7bc8280
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f430 ofs=0430
000000000000043f 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f440 ofs=0440
0000000000000000 0003000000000004
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f450 ofs=0450
0000000000000004 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f460 ofs=0460
0300000000000068 8040000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f470 ofs=0470
c000c00000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f480 ofs=0480
0000000000000000 0000000003c4ae81
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f490 ofs=0490
00000000fbe4f960 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f4a0 ofs=04a0
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f4b0 ofs=04b0
0000000000000000 0000000000000004
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data
resource=2000000000000dfe adr=c00000012ec3f4c0 ofs=04c0
0000000000000004 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007c:print_error_data
HCAD_ERROR EHCA ----- error data end
----------------------------------------------------
More information about the general
mailing list