[libfabric-users] Is this a reasonable verbs; ofi_rxm warm-up time on an IB-based Cray CS system?

Wed Jun 10 10:58:17 PDT 2020

> I'm seeing what looks like a significant "warm-up" time for a concentrated-load stress
> test on an IB-based Cray CS system using the 'verbs;ofi_rxm' provider, and am wondering
> if it's expected/reasonable. The test does fi_send() operations from each of 36 threads
> on 14 nodes, with each transmitting thread having its own tx endpoint. The sends are
> all directed at a single rx endpoint on a 15th node, where they land in a multi-receive
> buffer and a single thread consumes them.  I don't normally set any FI_* environment
> variables when running the test, though see the last paragraph below for an exception.
> 
> 
> I've recently discovered that it takes a fair amount of wall time, around 4 seconds, to
> get through the first fi_send() from all the source threads. I'm assuming this is due
> to RxM dynamic endpoint connection, perhaps exacerbated by contention due to the
> concentrated load at the single target endpoint. Under the circumstances I've described
> here, is 4s reasonable for dynamically establishing 504 connections?

Someone from HPE will need to chime in here, but to me that time sounds like connection setup time, combined with maybe a 2-4 second retransmission timeout.  4 seconds to establish 504 connections seems too long, but if any packets are dropped, a retransmission timeout could explain the delay.

> There may be some other overheads present as well -- as I raise the value of
> FI_OFI_RXM_MSG_RX_SIZE from its default of 128 to higher values I can eventually see
> (around 1024 or so) the warm-up time start to become proportional to my specified
> msg_rx_size value. The implication is that some aspect of the endpoint creation is
> space-dependent.  That said, reducing msg_rx_size below the default 128 doesn't
> correspondingly reduce the time needed for the first fi_send()s below 4s, so my guess
> is that at smaller values like that such space-related  overhead, if there is any, is
> not significant.

The time to allocate the hardware resources and map them into user space would be dependent on the queue sizes.  There's also receive buffer allocation, registration, and posting.  The time to perform those operations will depend on the sizes.

- Sean