[ofa-general] IPoIB-UD post_send failures (OFED 1.3)
Jack Morgenstein
jackm at dev.mellanox.co.il
Wed May 21 07:36:00 PDT 2008
Arthur,
I just checked in a fix for bugzilla 1004, which seems to be the same problem you are seeing.
(I just noticed your explanation in this thread in an earlier post:
"So a call to ipoib_cm_send() with tx_outstanding = (ipoib_sendq_size - 2),
followed by a call to ipoib_send() would get to a situation where
the queue was full, but not stopped." ).
This is correct, and this was the bug (in addition to a missing invocation
of netif_stop_queue in ipoib_ib_tx_timer_func() ).
The patch uses the same value for tx_outstanding in all cases in the
test for invoking netif_stop_queue(), so that there is no way the kernel
will continue to send TX packets to IPoIB if the queue becomes too full.
(using the same value in all tests creates a "barrier" with no holes).
This patch will be part of OFED 1.3.1-rc2 -- and you should see no more
mthca "queue full" messages.
- Jack
P.S., this fix is not needed in the upstream kernel, since the unsignalled UD
send mechanism was not added upstream.
On Sunday 11 May 2008 13:23, akepner at sgi.com wrote:
> On Sun, May 11, 2008 at 11:18:19AM +0300, Eli Cohen wrote:
> > ....
> > The reason why the queue is stopped when there is one entry still left
> > is to allow ipoib_ib_tx_timer_func() to post a special send request that
> > will ensure a completion is reported for this operation thus freeing
> > entries at the tx ring. I don't think the scenario you describe here can
> > lead to a deadlock since if that happens, it will be released because of
> > either one of the following two reasons:
> > 1. If the tx queue contains not yet polled, more than one completion of
> > send WRs posted by ipoib_cm_send(), they will soon be polled since they
> > are posted to a signaled QP and sooner or later will generate
> > completions and interrupts. In this case, subsequent postings to
> > ipoib_send() will work as expected.
> >
> > 2. If there is only one outstanding ipoib_cm_send() at the tx queue, it
> > means that there are 126 outstanding ipoib_send() requests at the tx
> > queue and this means that a few of them are signaled and are expected to
> > be completed soon.
>
> Thanks for the explanation.
>
> The main problem that we're seeing is that we just stop getting
> completions for the send queue. (And we see this with OFED-1.2
> and 1.3, which makes me think that it's unlikely to be due to the
> IPoIB driver since that's changed so much.)
>
> > .....
> > And last, could you arrange a remote access to a machine in this
> > condition so we could check the state of the device/FW?
> >
>
> Yes, I think so. Let me see if I can arrange that.
>
More information about the general
mailing list