[libfabric-users] sockets provider, number of outstanding reads

Tue Feb 9 08:44:39 PST 2016

Hi,

I recently ran into this behavior using the sockets provider in a 
threaded environment, using FI_RMA.  My example attempts to post N 
number of fi_reads to get a registered source buffer into a registered 
destination buffer.  This example is a simple benchmark so I'm not 
worried about the integrity of the destination buffer.  Let's say N=256, 
so for each message size, I'll do 256 fi_reads for each message size up 
to a max of 1MB, waiting for N completions after each iteration.  Base 
pointers for each N remain the same.

This works great up to around 256K size reads.  After that point I will 
no longer get local completions and the example hangs.  There are two 
things I can do to get it to work reliably:

1) Simply increase the destination buffer size to msg_size*N, but still 
issuing reads at the same base offset works.  Incrementing the 
destination offset by msg_size for each N also works in this case, as 
expected.

2) Limit the posting rate of reads for larger sizes.  I check to make 
sure I don't exceed fi_tx_size_left() but I'll have to throttle the rate 
significantly to ensure progress, I think it was around 32MB "in-flight" 
before it was stable on my system.

I don't see this behavior with psm provider, or with fi_writes. 
Question is, is there some read limit I need to observe that I'm missing 
or is there some issue with lots of outstanding socket reads?

thanks,
- ezra