[ofa-general] IPoIB post_send failed

Pradeep Satyanarayana pradeeps at linux.vnet.ibm.com
Wed Jul 29 11:48:19 PDT 2009


Hal Rosenstock wrote:
> Hi Pradeep,
> 
> On Wed, Jul 29, 2009 at 2:14 PM, Pradeep Satyanarayana
> <pradeeps at linux.vnet.ibm.com <mailto:pradeeps at linux.vnet.ibm.com>> wrote:
> 
>     Hal Rosenstock wrote:
>     > Hi,
>     >
>     > I'm seeing the following messages from IPoIB:
>     > ib0: post_send failed
>     > ib0: post_send failed
>     > ib0: post_send failed
>     > ib0: post_send failed
>     > ib0: post_send failed
>     > ib0: post_send failed
>     > NETDEV WATCHDOG: ib0: transmit timed out
>     > ib0: transmit timeout: latency 1374 msecs
>     > ib0: queue stopped 1, tx_head 140245691, tx_tail 140245565
>     >
>     > What are the possible (and most likely) causes of post_send
>     failures ? I
>     > went through the code for all the errors (some at the driver
>     level) but
>     > none popped out at me.
>     >
> 
>     Is it possible that the receiver is overwhelmed and hence the
>     tx_ring is full?
> 
>  
> It's possible but from the message you can't tell whether the tx_ring is
> full.
>  
> Does it make sense to increase the transmit ring size via
> send_queue_size mod param ?

Given that there are many concurrent clients and at least some UDP, I have a suspicion
that the receiver is indeed overwhelmed. On the contrary, instead of increasing the 
send_queue_size on the client, which may make the situation worse, please consider reducing 
the tx_ring size on the clients and increase the rx_ring on the server. This will sort of
throttle the flow.

Are you concerned about the messages you see, or is that actually impacting the application?
You may still see the messages with the above changes (may need some tuning), but hopefully
you will see a reduced impact on the applications. I would be interested in learning what you
discover. Thanks!

Pradeep

>  
> 
> 
>     Is this a UDP application?
> 
>  
> There is at least some UDP and there are many concurrent clients.
>  
> 
> 
> 
>     > Once the transmit queue is stopped, does the interface need to be
>     taken
>     > down and then back up to restart this ?
> 
>     One does not need to take down the interface. It should be able to
>     recover on it's
>     own. There is a timer that kicks in and checks if the tx_ring is
>     still full or not-
>     the transmits should start again. Thanks!
> 
>  
> Thanks for the help!
>  
> -- Hal
>  
> 
> 
> 
>     Pradeep
> 
> 





More information about the general mailing list