[ofa-general] IPoIB-UD post_send failures (OFED 1.3)

akepner at sgi.com akepner at sgi.com
Sun May 11 03:23:45 PDT 2008


On Sun, May 11, 2008 at 11:18:19AM +0300, Eli Cohen wrote:
> ....
> The reason why the queue is stopped when there is one entry still left
> is to allow ipoib_ib_tx_timer_func() to post a special send request that
> will ensure a completion is reported for this operation thus freeing
> entries at the tx ring. I don't think the scenario you describe here can
> lead to a deadlock since if that happens, it will be released because of
> either one of the following two reasons:
> 1. If the tx queue contains not yet polled, more than one completion of
> send WRs posted by ipoib_cm_send(), they will soon be polled since they
> are posted to a signaled QP and sooner or later will generate
> completions and interrupts. In this case, subsequent postings to
> ipoib_send() will work as expected.
> 
> 2. If there is only one outstanding ipoib_cm_send() at the tx queue, it
> means that there are 126 outstanding ipoib_send() requests at the tx
> queue and this means that a few of them are signaled and are expected to
> be completed soon.

Thanks for the explanation. 

The main problem that we're seeing is that we just stop getting 
completions for the send queue. (And we see this with OFED-1.2 
and 1.3, which makes me think that it's unlikely to be due to the 
IPoIB driver since that's changed so much.) 

> .....
> And last, could you arrange a remote access to a machine in this
> condition so we could check the state of the device/FW?
> 

Yes, I think so. Let me see if I can arrange that.

-- 
Arthur




More information about the general mailing list