[openib-general] ibv_reg_mr failure with pvfs on ehca?

Troy Benjegerdes troy at scl.ameslab.gov
Wed Oct 18 09:55:28 PDT 2006


(I am taking this back to the openib list because I think the list  
needs to hear about real applications that are hitting memory  
registration limits)

What are the limits on the ehca memory registrations?

Is there a limit to the number of regions that can be registered? Is  
there any way (with kernel hacks) that we can register the entire  
address space of the application? We would like to be able to do RDMA  
sends and receives from anywhere in the application address space  
eventually, and only register it once.

What is the point of RDMA for memory-intensive applications if you  
have to copy the data to a registered buffer before sending it anyway?


On Oct 18, 2006, at 11:27 AM, Kyle Schochenmaier wrote:

> Hoang-Nam Nguyen wrote:
>> Hi Troy!
>>
>>> I am running PVFS2 on OpenIB, with IBM's ehca.
>>> When we start writing/reading large files, either with the NetPIPE
>>> PVFS module we have or a modified GAMESS executable that uses
>>> libpvfs2 directly, the 'ibv_reg_mr' function fails, and we get an  
>>> error.
>>> This is also correlated with kernel log messages like this:
>>> Oct 16 11:14:45 p5l8 kernel: PU0003 000e0091:ehca_hcall_7arg_7ret
>>> HCAD_ERROR  opco
>>> de=160 ret=fffffffffffffff7 arg1=1000000003000004 arg2=5
>>> arg3=14f0ebc8 arg4=10000
>>> arg5=e0000000000000 arg6=e3e9f200 arg7=0 out1=0 out2=0 out3=0 out4=0
>>> out5=0 out6=0
>>> out7=0
>>>
>> Return code f7 from firmware/hvcall means H_NO_MEM. I'm wondering
>> if you could provide me with some pre-history of this problem.
>> Is this a permanent problem? If yes, could you give me more infos
>> on your testcase resp. scenario eg large file size, NetPIPE options?
>> Which version of ehca are you using? And which kernel version?
>> Thanks!
>> Hoang-Nam Nguyen
>>
>>
> I think Troy could better explain what is happening here, so I'm  
> taking this off-list for now -- we're trying to get this working  
> for SC'06, so time is limited :) -- if Troy wants to forward this  
> on to the list after looking at it, thats fine too.
> Our app writes out a file once, then reads it in many times through  
> the pvfs2 system.  In the pvfs2 layers, there is memory caching  
> done at the network level, so memory is registered by the app, and  
> attempts are made to re-register and/or re-use these memory regions  
> to save on memory reg overhead.  The problem occurs only while  
> writing files, so while memory is being initially registered with  
> the nic/app and cached?  Also, our tests show that the app runs  
> normally to completion on identical machines using mellanox hca's  
> instead of the eHCA.  The file sizes are generally >16GByte,  
> however our failures usually appear by the time ~220-250MBytes have  
> been written(possibly also all registered)?
>
> I'm not sure the standard OpenIB NetPIPE runs can reproduce this  
> type of workload.  However, we have developed a working PVFS2- 
> NetPIPE module which can reproduce this problem on occassion, if  
> there is interest in further testing this on your end, I can make  
> it available.
>
> Our ehca's have the following revision info:
>        vendor_id:                      0x5076
>        vendor_part_id:                 0
>        hw_ver:                         0x1000003
> Kernel version is debian 2.6.17
>
> I hope this is enough info to get some more insight from everyone.





More information about the general mailing list