[openib-general] ibv_reg_mr failure with pvfs on ehca?

Hoang-Nam Nguyen HNGUYEN at de.ibm.com
Tue Oct 24 00:21:38 PDT 2006


Hi Kyle!
> And, setting the debug_level flag definitely caused the server to not
> respond...  I rebooted and tried it again, same thing, setting the
> debug_level flag causes the server to crash. (I can still login, but
> cannot execute anything, e.g. 'ls', it seems all the cpu's are spinning)
> p5l5:~# modprobe hcad_mod nr_ports=1 debug_level=99999999
> console output after above command hangs server:
> PU0003 000e0252:hipz_h_register_rpage >>>
> adapter_handle=1000000203000004 pagesize=0 queue_type=0
> resource_handle=7000000100018600 logical_address_of_page=e6741000
count=200
> PU0003 000e0078:ehca_hcall_7arg_7ret >>> opcode=1ac
> arg1=1000000203000004 arg2=0 arg3=7000000100018600 arg4=e6741000
> arg5=200 arg6=0 arg7=0
> PU0003 000e0096:ehca_hcall_7arg_7ret <<< opcode=1ac ret=f out1=50
> out2=50 out3=50 out4=50 out5=50 out6=50 out7=50
> PU0003 000e0263:hipz_h_register_rpage <<< ret=f
> PU0003 000e04ad:hipz_h_register_rpage_mr <<< ret=f
> PU0003 0009076c:ehca_set_pagebuf >>> pginfo=c0000000eb7b75e0 type=1
> num_pages=1d4000 num_4k=1d4000 next_buf=0 next_4k=30600 number=200
> kpage=c0000000e6741000 page_cnt=30600 page_4k_cnt=30600 next_listelem=0
> region=0000000000000000 next_chunk=0000000000000000 next_nmap=0
> PU0003 00090807:ehca_set_pagebuf <<< ret=0 e_mr=c0000000e1ac2e80
> pginfo=c0000000eb7b75e0 type=1 num_pages=1d4000 num_4k=1d4000 next_buf=0
> next_4k=30800 number=200 kpage=c0000000e6742000 page_cnt=30800
> page_4k_cnt=30800 i=200 next_listelem=0 region=0000000000000000
> next_chunk=0000000000000000 next_nmap=0
> PU0003 000e049e:hipz_h_register_rpage_mr >>>
> adapter_handle=1000000203000004 mr=c0000000e1ac2e80
> mr_handle=7000000100018600 pagesize=0 queue_type=0
> logical_address_of_page=e6741000 count=200
> PU0003 000e0252:hipz_h_register_rpage >>>
> adapter_handle=1000000203000004 pagesize=0 queue_type=0
> resource_handle=7000000100018600 logical_address_of_page=e6741000
count=200
> PU0003 000e0078:ehca_hcall_7arg_7ret >>> opcode=1ac
> arg1=1000000203000004 arg2=0 arg3=7000000100018600 arg4=e6741000
> arg5=200 arg6=0 arg7=0
> PU0003 000e0096:ehca_hcall_7arg_7ret <<< opcode=1ac ret=f out1=50
> out2=50 out3=50 out4=50 out5=50 out6=50 out7=50
> PU0003 000e0263:hipz_h_register_rpage <<< ret=f
> <snip, it repeats forever>
We looked at the traces above and saw a register MR with 0x1d4000 pages,
that's about 7,3GB. In this trace part we are at registering the
pages 0x30600-0x307FF. So we really guess the system seems to be
busy with flushing out the remaining traces and appears to hang
while you can do login or ping to it.
Fortunately you have an "old" version of ehca that allows selecting
debug traces for certain components. In this case I would filter
only debug traces for mrmw, and the command for that looks like
this:
echo 66666666696666666666 > /sys/bus/ibmebus/drivers/ehca/debug_level
              ^this should turn on debug traces for mrmw only
Or you pass the option debug_level to modprobe:
modprobe hcad_mod debug_level=66666666696666666666
then you should see only mrmw traces in dmesg and that's still a lot,
because we do register the whole mem space at module load time.
If that still seems to hang, I can provide you with a debug patch
later. For now please give us little time to set up test envs and
recreate your problem.
Thanks!
Nam





More information about the general mailing list