[Scst-devel] [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and eventually hangs

Bart Van Assche bart.vanassche at gmail.com
Thu Sep 17 03:22:11 PDT 2009


On Wed, Sep 16, 2009 at 9:41 PM, Chris Worley <worleys at gmail.com> wrote:
>
> On Wed, Sep 16, 2009 at 12:15 PM, Vladislav Bolkhovitin <vst at vlnb.net> wrote:
> > Chris Worley, on 09/16/2009 12:51 AM wrote:
> > >
> > > On Tue, Sep 15, 2009 at 11:10 AM, Vladislav Bolkhovitin <vst at vlnb.net>
> > > wrote:
> > > [ ... ]
> > > [  357.250550] ib_srpt: srpt_xmit_response: tag= 38 channel in bad state 2
> > > [  357.250553] scst: ***ERROR***: Target driver ib_srpt
> > > xmit_response() returned fatal error
> >
> > It's because srpt called scst_tgt_cmd_done() when the corresponding command
> > hasn't yet been sent to xmit_response() callback, so srpt should use another
> > function to abort commands in this state.
>
> Could this be related to the hang (i.e. the command has been aborted
> before xmit_response has been called... but w/o causing a panic)?

When analyzing such logs it's important to distinguish between cause
and consequence. What happened first is that the OFED SRP initiator
noticed that something went wrong with the IB communication, as
indicated by the log message "srp_qp_in_err_timer called". This means
that an error occurred in the IB network or in one of the two IB
stacks. This resulted in the SRP initiator trying to relogin without
intervening logout. The error messages logged by SRPT are a
consequence of the initiator relogin. While the SRPT issue will be
fixed, such a fix won't solve the slow reads and the hang you
observed.

Regarding the SRP communication problems you observed: since my
attempts to reproduce this issue have been unsuccessful so far, I'm
afraid these communication problems are caused by some component in
your IB network that is not working as reliable as it should.

By the way, the description of the patch that generated the message
"srp_qp_in_err_timer called" is interesting. The patch description
indicates that the condition "srp_qp_in_err_timer called" should only
happen during multipath failover. See also
http://www.mail-archive.com/ewg@lists.openfabrics.org/msg01959.html
(which is not the latest version of this patch).

Bart.



More information about the general mailing list