[openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer

Helen Chen hycsw at ca.sandia.gov
Thu Oct 13 16:15:38 PDT 2005


Roland,

Ci
So you are right, it is not a moving target.  After repeating 
the IOZONE tests several times, I narrowed down the culprit,
server on3-ib.  Parallel I/O had made it a bit difficult to 
chase it down :-(  

BTW, the state of the IPoIB network seemed fine after the failed
test, nd the mthca counters are moving up nicely.  Do you still 
think this is a crash of the HCA firmware?  Should I call Mellanox? 

Thanks,
Helen


---------- Original Message -----------------
>From rolandd at cisco.com Thu Oct 13 15:13:16 2005
>
>    Helen> It doesn't seem like shrinking the TCP window had helped.
>    Helen> I captured the Dmesg log from Lustre server and associated
>    Helen> client reporting IOZONE error.
>
>What is the state of the system after you start seeing the ib0
>transmit time out messages?  Does IPoIB work at all?  Is the HCA
>responsive at all -- for example what do you see if you do
>
>  cat /sys/class/infiniband/mthca0/ports/1/state
>
>or
>
>  cat /sys/class/infiniband/mthca0/ports/1/counters/*
>
>    Helen> BTW, this problem is a moving target so it is hard to
>    Helen> believe that it is hardware related(?)  BTW, I am using the
>    Helen> mellanox DDR switch and HCA.
>
>Not sure what you mean by a moving target... the symptoms really look
>like a crash of the HCA firmware to me.
>
>Thanks,
>  Roland
>



More information about the general mailing list