[openib-general] ibv_reg_mr failure with pvfs on ehca?
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Mon Oct 23 07:17:00 PDT 2006
Hoang-Nam Nguyen wrote:
> Hi Troy!
>
>> The netpipe code is available with mercurial by:
>> hg clone http://source.scl.ameslab.gov/hg/netpipe3-pvfs-dev
>> Once you have pvfs2-1.5.1 installed, you should be able to do 'make
>> pvfs' in the netpipe3-pvfs-dev directory and build NPpvfs.
>> The command line arguments I used to reproduce this were:
>> ./NPpvfs -d $PVFS_FILE_PATH -l 32768 -u 268435456 -n 100 -o
>> $NETPIPE_OUTPUT_FILE
>>
> Thanks for this. I've been struggling with setting up the systems
> to recreate this problem. Please be patient.
> Can you please send me the ouput of modinfo ib_ehca (or hcad_mod
> in older version)? Also the firmware code level as plained in
> previous email. How many memory have you assigned to the partition?
> With those data I'd be able to have nearly the same envs like yours.
>
>> This is the dmesg log:
>> PU0001 000e0091:ehca_hcall_7arg_7ret HCAD_ERROR opcode=160
>> ret=fffffffffffffff7 arg1=1000000003000004 arg2=5 arg3=4000f830000
>> arg4=10000 arg5=e0000000000000 arg6=eb6b6920 arg7=0 out1=0 out2=0
>> out3=0 out4=0 out5=0 out6=0 out7=0
>> PU0001 00090454:ehca_reg_mr HCAD_ERROR hipz_alloc_mr failed,
>> h_ret=fffffffffffffff7 hca_hndl=1000000003000004
>> PU0001 00090478:ehca_reg_mr <<< ret=ffffffea shca=c0000000e796b000
>> e_mr=c0000000ce865e80 iova_start=000004000f830000 size=10000 acl=7
>> e_pd=c0000000eb6b6920 pginfo=c0000000dfcb3a70 num_pages=10 num_4k=10
>> PU0001 00090176:ehca_reg_user_mr <<< rc=ffffffffffffffea
>> pd=c0000000eb6b6920 region=c0000000ce861dd0 mr_access_flags=7
>> udata=c0000000dfcb3ba0
>>
> I got this already from you and Kyle. I meant the full log with
> debug traces enabled: modprobe ib_ehca debug_level=1 or for older
> versions modprobe hcad_mod debug_level=9999999999999999999999. If
> possible, try to get it. Anyway I'll do that with my test env.
> Thanks!
> Nam
>
>
>
I believe we have 8GB allocated on each this box(all memory and cpus
allocated to one partition ), and we're running firmware version SF240_233.
p5l5:~# modinfo hcad_mod
filename:
/lib/modules/2.6.17/kernel/drivers/infiniband/hw/ehca/hcad_mod.ko
version: SVNEHCA_0009
description: IBM eServer HCA InfiniBand Device Driver
author: Christoph Raisch <raisch at de.ibm.com>
license: Dual BSD/GPL
srcversion: 2B35F7963CEB9E6067F3F92
depends: ib_core
vermagic: 2.6.17 SMP mod_unload gcc-4.0
parm: open_aqp1:AQP1 on startup (0: no (default), 1: yes) (int)
parm: debug_level:debug level (0: node, 6: only errors
(default), 9: all) (int)
parm: hw_level:hardware level (0: autosensing (default), 1: v.
0.20, 2: v. 0.21) (int)
parm: nr_ports:number of connected ports (default: 2) (int)
parm: use_hp_mr:high performance MRs (0: no (default), 1: yes)
(int)
parm: port_act_time:time to wait for port activation (default:
30 sec) (int)
parm: poll_all_eqs:polls all event queues periodically (0: no,
1: yes (default)) (int)
parm: static_rate:set permanent static rate (default:
disabled) (int)
And, setting the debug_level flag definitely caused the server to not
respond... I rebooted and tried it again, same thing, setting the
debug_level flag causes the server to crash. (I can still login, but
cannot execute anything, e.g. 'ls', it seems all the cpu's are spinning)
p5l5:~# modprobe hcad_mod nr_ports=1 debug_level=99999999
console output after above command hangs server:
PU0003 000e0252:hipz_h_register_rpage >>>
adapter_handle=1000000203000004 pagesize=0 queue_type=0
resource_handle=7000000100018600 logical_address_of_page=e6741000 count=200
PU0003 000e0078:ehca_hcall_7arg_7ret >>> opcode=1ac
arg1=1000000203000004 arg2=0 arg3=7000000100018600 arg4=e6741000
arg5=200 arg6=0 arg7=0
PU0003 000e0096:ehca_hcall_7arg_7ret <<< opcode=1ac ret=f out1=50
out2=50 out3=50 out4=50 out5=50 out6=50 out7=50
PU0003 000e0263:hipz_h_register_rpage <<< ret=f
PU0003 000e04ad:hipz_h_register_rpage_mr <<< ret=f
PU0003 0009076c:ehca_set_pagebuf >>> pginfo=c0000000eb7b75e0 type=1
num_pages=1d4000 num_4k=1d4000 next_buf=0 next_4k=30600 number=200
kpage=c0000000e6741000 page_cnt=30600 page_4k_cnt=30600 next_listelem=0
region=0000000000000000 next_chunk=0000000000000000 next_nmap=0
PU0003 00090807:ehca_set_pagebuf <<< ret=0 e_mr=c0000000e1ac2e80
pginfo=c0000000eb7b75e0 type=1 num_pages=1d4000 num_4k=1d4000 next_buf=0
next_4k=30800 number=200 kpage=c0000000e6742000 page_cnt=30800
page_4k_cnt=30800 i=200 next_listelem=0 region=0000000000000000
next_chunk=0000000000000000 next_nmap=0
PU0003 000e049e:hipz_h_register_rpage_mr >>>
adapter_handle=1000000203000004 mr=c0000000e1ac2e80
mr_handle=7000000100018600 pagesize=0 queue_type=0
logical_address_of_page=e6741000 count=200
PU0003 000e0252:hipz_h_register_rpage >>>
adapter_handle=1000000203000004 pagesize=0 queue_type=0
resource_handle=7000000100018600 logical_address_of_page=e6741000 count=200
PU0003 000e0078:ehca_hcall_7arg_7ret >>> opcode=1ac
arg1=1000000203000004 arg2=0 arg3=7000000100018600 arg4=e6741000
arg5=200 arg6=0 arg7=0
PU0003 000e0096:ehca_hcall_7arg_7ret <<< opcode=1ac ret=f out1=50
out2=50 out3=50 out4=50 out5=50 out6=50 out7=50
PU0003 000e0263:hipz_h_register_rpage <<< ret=f
<snip, it repeats forever>
--
Kyle Schochenmaier
kschoche at scl.ameslab.gov
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory
More information about the general
mailing list