[openib-general] ibv_reg_mr failure with pvfs on ehca?

Kyle Schochenmaier kschoche at scl.ameslab.gov
Mon Oct 23 07:17:00 PDT 2006


Hoang-Nam Nguyen wrote:
> Hi Troy!
>   
>> The netpipe code is available with mercurial by:
>> hg clone http://source.scl.ameslab.gov/hg/netpipe3-pvfs-dev
>> Once you have pvfs2-1.5.1 installed, you should be able to do 'make
>> pvfs' in the netpipe3-pvfs-dev directory and build NPpvfs.
>> The command line arguments I used to reproduce this were:
>> ./NPpvfs -d $PVFS_FILE_PATH -l 32768 -u 268435456 -n 100 -o
>> $NETPIPE_OUTPUT_FILE
>>     
> Thanks for this. I've been struggling  with setting up the systems
> to recreate this problem. Please be patient.
> Can you please send me the ouput of modinfo ib_ehca (or hcad_mod
> in older version)? Also the firmware code level as plained in
> previous email. How many memory have you assigned to the partition?
> With those data I'd be able to have nearly the same envs like yours.
>   
>> This is the dmesg log:
>> PU0001 000e0091:ehca_hcall_7arg_7ret HCAD_ERROR  opcode=160
>> ret=fffffffffffffff7 arg1=1000000003000004 arg2=5 arg3=4000f830000
>> arg4=10000 arg5=e0000000000000 arg6=eb6b6920 arg7=0 out1=0 out2=0
>> out3=0 out4=0 out5=0 out6=0 out7=0
>> PU0001 00090454:ehca_reg_mr HCAD_ERROR  hipz_alloc_mr failed,
>> h_ret=fffffffffffffff7 hca_hndl=1000000003000004
>> PU0001 00090478:ehca_reg_mr <<< ret=ffffffea shca=c0000000e796b000
>> e_mr=c0000000ce865e80 iova_start=000004000f830000 size=10000 acl=7
>> e_pd=c0000000eb6b6920 pginfo=c0000000dfcb3a70 num_pages=10 num_4k=10
>> PU0001 00090176:ehca_reg_user_mr <<< rc=ffffffffffffffea
>> pd=c0000000eb6b6920 region=c0000000ce861dd0 mr_access_flags=7
>> udata=c0000000dfcb3ba0
>>     
> I got this already from you and Kyle. I meant the full log with
> debug traces enabled: modprobe ib_ehca debug_level=1 or for older
> versions modprobe hcad_mod debug_level=9999999999999999999999. If
> possible, try to get it. Anyway I'll do that with my test env.
> Thanks!
> Nam
>
>
>   
I believe we have 8GB allocated on each this box(all memory and cpus 
allocated to one partition ), and we're running firmware version SF240_233.

p5l5:~# modinfo hcad_mod
filename:       
/lib/modules/2.6.17/kernel/drivers/infiniband/hw/ehca/hcad_mod.ko
version:        SVNEHCA_0009
description:    IBM eServer HCA InfiniBand Device Driver
author:         Christoph Raisch <raisch at de.ibm.com>
license:        Dual BSD/GPL
srcversion:     2B35F7963CEB9E6067F3F92
depends:        ib_core
vermagic:       2.6.17 SMP mod_unload gcc-4.0
parm:           open_aqp1:AQP1 on startup (0: no (default), 1: yes) (int)
parm:           debug_level:debug level (0: node, 6: only errors 
(default), 9: all) (int)
parm:           hw_level:hardware level (0: autosensing (default), 1: v. 
0.20, 2: v. 0.21) (int)
parm:           nr_ports:number of connected ports (default: 2) (int)
parm:           use_hp_mr:high performance MRs (0: no (default), 1: yes) 
(int)
parm:           port_act_time:time to wait for port activation (default: 
30 sec) (int)
parm:           poll_all_eqs:polls all event queues periodically (0: no, 
1: yes (default)) (int)
parm:           static_rate:set permanent static rate (default: 
disabled) (int)

And, setting the debug_level flag definitely caused the server to not 
respond...  I rebooted and tried it again, same thing, setting the 
debug_level flag causes the server to crash. (I can still login, but 
cannot execute anything, e.g. 'ls', it seems all the cpu's are spinning)
p5l5:~# modprobe hcad_mod nr_ports=1 debug_level=99999999

console output after above command hangs server:
PU0003 000e0252:hipz_h_register_rpage >>> 
adapter_handle=1000000203000004 pagesize=0 queue_type=0 
resource_handle=7000000100018600 logical_address_of_page=e6741000 count=200
PU0003 000e0078:ehca_hcall_7arg_7ret >>> opcode=1ac 
arg1=1000000203000004 arg2=0 arg3=7000000100018600 arg4=e6741000 
arg5=200 arg6=0 arg7=0
PU0003 000e0096:ehca_hcall_7arg_7ret <<< opcode=1ac ret=f out1=50 
out2=50 out3=50 out4=50 out5=50 out6=50 out7=50
PU0003 000e0263:hipz_h_register_rpage <<< ret=f
PU0003 000e04ad:hipz_h_register_rpage_mr <<< ret=f
PU0003 0009076c:ehca_set_pagebuf >>> pginfo=c0000000eb7b75e0 type=1 
num_pages=1d4000 num_4k=1d4000 next_buf=0 next_4k=30600 number=200 
kpage=c0000000e6741000 page_cnt=30600 page_4k_cnt=30600 next_listelem=0 
region=0000000000000000 next_chunk=0000000000000000 next_nmap=0
PU0003 00090807:ehca_set_pagebuf <<< ret=0 e_mr=c0000000e1ac2e80 
pginfo=c0000000eb7b75e0 type=1 num_pages=1d4000 num_4k=1d4000 next_buf=0 
next_4k=30800 number=200 kpage=c0000000e6742000 page_cnt=30800 
page_4k_cnt=30800 i=200 next_listelem=0 region=0000000000000000 
next_chunk=0000000000000000 next_nmap=0
PU0003 000e049e:hipz_h_register_rpage_mr >>> 
adapter_handle=1000000203000004 mr=c0000000e1ac2e80 
mr_handle=7000000100018600 pagesize=0 queue_type=0 
logical_address_of_page=e6741000 count=200
PU0003 000e0252:hipz_h_register_rpage >>> 
adapter_handle=1000000203000004 pagesize=0 queue_type=0 
resource_handle=7000000100018600 logical_address_of_page=e6741000 count=200
PU0003 000e0078:ehca_hcall_7arg_7ret >>> opcode=1ac 
arg1=1000000203000004 arg2=0 arg3=7000000100018600 arg4=e6741000 
arg5=200 arg6=0 arg7=0
PU0003 000e0096:ehca_hcall_7arg_7ret <<< opcode=1ac ret=f out1=50 
out2=50 out3=50 out4=50 out5=50 out6=50 out7=50
PU0003 000e0263:hipz_h_register_rpage <<< ret=f
<snip, it repeats forever>


-- 
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory 





More information about the general mailing list