[ofa-general] What causes "SRP abort called" error?
John Valdes
valdes at anl.gov
Tue Jan 8 20:45:03 PST 2008
On Tue, Jan 08, 2008 at 10:48:48PM -0500, David Dillow wrote:
>
> The aborts are caused by a command timing out in the SCSI mid-layer and
> its error handling taking over -- more details about the escalation are
> in Documentation/scsi_eh.txt and assorted files.
Thanks for decoding those errors/warnings.
> You can turn up the SCSI logging facilities to track down the command
> that is dying, but expect that to be _very_ noisy on a busy system.
>From coincident errors on the DDN, it looks like these are SCSI Write
commands (2A) that are failing.
> I've often seen this during the initial bus scan when adding a target to
> SRP, and I've seen it happen under heavy load once -- maybe more, but I
> saw it today for sure.
In our case, I'm pretty sure it is heavy load. Well, I didn't see
what was going on at the time this started, but the targets (LUNs)
were already mounted, and we've been seeing heavy load on the DDN
recently.
> I am curious, though, what command could be getting stuck
> for long enough for the mid-layer to time it out -- I think the default
> timeout for the sd driver is 60 seconds, and the INQUIRY timeout is 5
> seconds. I just cannot account for what could be taking that long.
I'm curious too as to why WRITEs are taking so long. :) I think we're
overloading the DDN, but it could be something else going on. This is
a freshly installed configuration (only about a week old), with 6 GFS
file servers reading and writing to ~6 shared LUNs on the DDN over
IB/SRP (which in turn are shared off the servers via NFS to a ~350
node HPC cluster). We've been running an identical setup for a few
years with another DDN, but over FC. I think we have still have a few
things to tune/optimize for IB.
That said, after talking with DDN support, it's looking like something
got wedged on the DDN which was causing the timeouts.
> Do your targets come back after this? During the scans, mine do, but
> today's under load effectively left the target dead. Rebooting the
> server brought it back.
Yes, after unwedging the DDN, the targets were fully accessible on the
server again.
Thanks for the reply.
John
More information about the general
mailing list