[ofa-general] IPoIB-UD post_send failures (OFED 1.3)

Jack Morgenstein jackm at dev.mellanox.co.il
Wed May 21 07:36:00 PDT 2008


Arthur,
I just checked in a fix for bugzilla 1004, which seems to be the same problem you are seeing.
(I just noticed your explanation in this thread in an earlier post:
"So a call to ipoib_cm_send() with tx_outstanding = (ipoib_sendq_size - 2), 
followed by a call to ipoib_send() would get to a situation where 
the queue was full, but not stopped." ).

This is correct, and this was the bug (in addition to a missing invocation
of netif_stop_queue in ipoib_ib_tx_timer_func() ).
The patch uses the same value for tx_outstanding in all cases in the
test for invoking netif_stop_queue(), so that there is no way the kernel
will continue to send TX packets to IPoIB if the queue becomes too full.
(using the same value in all tests creates a "barrier" with no holes).

This patch will be part of OFED 1.3.1-rc2 -- and you should see no more
mthca "queue full" messages.

- Jack

P.S., this fix is not needed in the upstream kernel, since the unsignalled UD
send mechanism was not added upstream.

On Sunday 11 May 2008 13:23, akepner at sgi.com wrote:
> On Sun, May 11, 2008 at 11:18:19AM +0300, Eli Cohen wrote:
> > ....
> > The reason why the queue is stopped when there is one entry still left
> > is to allow ipoib_ib_tx_timer_func() to post a special send request that
> > will ensure a completion is reported for this operation thus freeing
> > entries at the tx ring. I don't think the scenario you describe here can
> > lead to a deadlock since if that happens, it will be released because of
> > either one of the following two reasons:
> > 1. If the tx queue contains not yet polled, more than one completion of
> > send WRs posted by ipoib_cm_send(), they will soon be polled since they
> > are posted to a signaled QP and sooner or later will generate
> > completions and interrupts. In this case, subsequent postings to
> > ipoib_send() will work as expected.
> > 
> > 2. If there is only one outstanding ipoib_cm_send() at the tx queue, it
> > means that there are 126 outstanding ipoib_send() requests at the tx
> > queue and this means that a few of them are signaled and are expected to
> > be completed soon.
> 
> Thanks for the explanation. 
> 
> The main problem that we're seeing is that we just stop getting 
> completions for the send queue. (And we see this with OFED-1.2 
> and 1.3, which makes me think that it's unlikely to be due to the 
> IPoIB driver since that's changed so much.) 
> 
> > .....
> > And last, could you arrange a remote access to a machine in this
> > condition so we could check the state of the device/FW?
> > 
> 
> Yes, I think so. Let me see if I can arrange that.
> 



More information about the general mailing list