[ofa-general] What causes "SRP abort called" error?
dillowda at ornl.gov
Tue Jan 8 19:48:48 PST 2008
On Tue, 2008-01-08 at 17:44 -0600, John Valdes wrote:
> I'm new to SRP & IB, so please bear with me...
> We have a storage server running RHEL 5.1 w/ the bundled OFED 1.2
> stack directly attached to an IB port on a DDN 9550. It's been running
> OK for about a week, but today we're getting a continuous stream of
> SRP abort errors:
> # tail /var/log/messages
> Jan 8 17:00:59 server kernel: SRP abort called
> Jan 8 17:01:59 server kernel: SRP abort called
srp_abort(), aka scsi_host->eh_abort_handler()
This tries to abort a single command.
> Jan 8 17:02:04 server kernel: SRP reset_device called
srp_reset_device(), aka scsi_host->eh_device_reset_handler()
This tries to reset a LUN.
> Jan 8 17:02:09 server kernel: ib_srp: SRP reset_host called
srp_reset_host(), aka scsi_host->eh_host_reset_handler()
This tries to reset the connection to the DDN.
> Jan 8 17:02:11 server kernel: ib_srp: connection closed
Caused by srp_reset_host()
> How can I determine the cause of the aborts?
The aborts are caused by a command timing out in the SCSI mid-layer and
its error handling taking over -- more details about the escalation are
in Documentation/scsi_eh.txt and assorted files.
You can turn up the SCSI logging facilities to track down the command
that is dying, but expect that to be _very_ noisy on a busy system.
I've often seen this during the initial bus scan when adding a target to
SRP, and I've seen it happen under heavy load once -- maybe more, but I
saw it today for sure. I haven't dug into it yet, as I've been tracking
other things. I am curious, though, what command could be getting stuck
for long enough for the mid-layer to time it out -- I think the default
timeout for the sd driver is 60 seconds, and the INQUIRY timeout is 5
seconds. I just cannot account for what could be taking that long.
Also, given how it is quickly progressing through the various error
handlers in SRP, I wonder if we're failing something in there.
Do your targets come back after this? During the scans, mine do, but
today's under load effectively left the target dead. Rebooting the
server brought it back.
More information about the general