[openib-general] ibv_reg_mr failure with pvfs on ehca?

Troy Benjegerdes troy at scl.ameslab.gov
Mon Oct 23 17:08:16 PDT 2006


On Oct 23, 2006, at 8:42 AM, Hoang-Nam Nguyen wrote:

> Hello Troy!
>> The netpipe code is available with mercurial by:
>> hg clone http://source.scl.ameslab.gov/hg/netpipe3-pvfs-dev
>> Once you have pvfs2-1.5.1 installed, you should be able to do 'make
>> pvfs' in the netpipe3-pvfs-dev directory and build NPpvfs.
>> The command line arguments I used to reproduce this were:
>> ./NPpvfs -d $PVFS_FILE_PATH -l 32768 -u 268435456 -n 100 -o
>> $NETPIPE_OUTPUT_FILE
> Did you compile pvfs and NPpvfs as 32-bit or 64-bit libs/execs?
> I did compile pvfs and NPpvfs as is and realized that pvfs is built
> by default as 32-bit and NPpvfs as 64-bit. Hence NPpvfs complained
> to find incompatible pvfs libs.
> Regards
> Nam
>

I wasn't able to get reliable backtraces out of a 64 bit NPpvfs and  
pvfs libs, so I rebuilt as 32 bit, and now I get much more  
interesting errors and kernel logs..

If I start 4 netpipe processes on the same node with:

  ./NPpvfs -l 32768 -u 268435456 -n 100 -o results/proc2.w.out -I -d / 
pvfs2/6node/proc2

I get errors like:

  27:  786429 bytes    100 times -->   2249.96 Mbps in    2666.70 usec
28:  786432 bytes    100 times --> [E 18:47:20.394586] Error:  
ib_check_cq: entry id 0x100ac7f0 opcode RDMA WRITE error  
IBV_WC_LOC_PROT_ERR.
[E 18:47:20.395051]     [bt] ./NPpvfs(error+0x9c) [0x1005858c]
[E 18:47:20.395087]     [bt] ./NPpvfs [0x10056a00]
[E 18:47:20.395118]     [bt] ./NPpvfs [0x1005726c]


And kernel logs like this:

Oct 23 18:48:37 p5l8 kernel: PU0007 00060066:print_error_data  
HCAD_ERROR  QP 0xdfe (resource=2000000000000dfe) has errors.
Oct 23 18:48:37 p5l8 kernel: PU0007 00060077:print_error_data  
HCAD_ERROR  Error data is available: 2000000000000dfe.
Oct 23 18:48:37 p5l8 kernel: PU0007 00060079:print_error_data  
HCAD_ERROR  EHCA ----- error data begin  
---------------------------------------------------
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f000 ofs=0000  
00000000000004d0 2000000000000dfe
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f010 ofs=0010  
0100000000000310 8000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f020 ofs=0020  
a000000500000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f030 ofs=0030  
0000000001000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f040 ofs=0040  
0000000000000001 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f050 ofs=0050  
0000000000000014 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f060 ofs=0060  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f070 ofs=0070  
000000000000ffff 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f080 ofs=0080  
008000000000262b 0000000000ffffff
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f090 ofs=0090  
0000000000ffffff 0000000009f49900
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f0a0 ofs=00a0  
00000000000e0492 000000000000000a
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f0b0 ofs=00b0  
0000000000000001 000000000000002b
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f0c0 ofs=00c0  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f0d0 ofs=00d0  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f0e0 ofs=00e0  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f0f0 ofs=00f0  
0000000000000000 0000000000000003
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f100 ofs=0100  
000000000000001a 0000000000000004
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f110 ofs=0110  
0000000000000004 0000000000000032
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f120 ofs=0120  
00000000dc9d4600 0000000003c32f28
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f130 ofs=0130  
000000000009f4aa 000000000009f4aa
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f140 ofs=0140  
0a00000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f150 ofs=0150  
0000000000000002 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f160 ofs=0160  
0000000000002633 000000000000262c
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f170 ofs=0170  
0000000000000001 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f180 ofs=0180  
0000000000000006 0000000000000004
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f190 ofs=0190  
0000000000000004 00000001da05023d
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f1a0 ofs=01a0  
000000000000001f 000000000000262b
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f1b0 ofs=01b0  
00000000dc9d4600 0000000003c32f28
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f1c0 ofs=01c0  
0000000000000001 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f1d0 ofs=01d0  
00000000dc9e5600 0000000003c33328
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f1e0 ofs=01e0  
0000000000000006 0000000000000001
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f1f0 ofs=01f0  
0000000000000003 000000000000262c
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f200 ofs=0200  
000000000009f499 0000000000000004
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f210 ofs=0210  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f220 ofs=0220  
0000000000000000 0000003000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f230 ofs=0230  
0000000000000002 000000000000262b
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f240 ofs=0240  
0000000000000000 000000000009f4a9
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f250 ofs=0250  
00000000e3e9f820 0000000000000106
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f260 ofs=0260  
0000000000000106 0000000000000003
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f270 ofs=0270  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f280 ofs=0280  
008000000000262b 0000000000ffffff
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f290 ofs=0290  
0000000000ffffff 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f2a0 ofs=02a0  
000000000000262c 8000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f2b0 ofs=02b0  
09f22a0000000000 3808000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f2c0 ofs=02c0  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f2d0 ofs=02d0  
0000000000000000 2000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f2e0 ofs=02e0  
8000000000000000 3808000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f2f0 ofs=02f0  
0000000000000000 6800000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f300 ofs=0300  
a800000000000000 0000003000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f310 ofs=0310  
4000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f320 ofs=0320  
0000000000000000 02000000000000c8
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f330 ofs=0330  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f340 ofs=0340  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f350 ofs=0350  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f360 ofs=0360  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f370 ofs=0370  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f380 ofs=0380  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f390 ofs=0390  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f3a0 ofs=03a0  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f3b0 ofs=03b0  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f3c0 ofs=03c0  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f3d0 ofs=03d0  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f3e0 ofs=03e0  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f3f0 ofs=03f0  
0000000000000000 0400000000000060
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f400 ofs=0400  
8000000000000000 c000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f410 ofs=0410  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f420 ofs=0420  
0000000003c2a383 00000000d7bc8280
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f430 ofs=0430  
000000000000043f 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f440 ofs=0440  
0000000000000000 0003000000000004
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f450 ofs=0450  
0000000000000004 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f460 ofs=0460  
0300000000000068 8040000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f470 ofs=0470  
c000c00000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f480 ofs=0480  
0000000000000000 0000000003c4ae81
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f490 ofs=0490  
00000000fbe4f960 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f4a0 ofs=04a0  
0000000000000000 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f4b0 ofs=04b0  
0000000000000000 0000000000000004
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007a:print_error_data   
resource=2000000000000dfe adr=c00000012ec3f4c0 ofs=04c0  
0000000000000004 0000000000000000
Oct 23 18:48:37 p5l8 kernel: PU0007 0006007c:print_error_data  
HCAD_ERROR  EHCA ----- error data end  
----------------------------------------------------





More information about the general mailing list