[Scst-devel] [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and eventually hangs

Chris Worley worleys at gmail.com
Thu Sep 3 10:38:34 PDT 2009


On Thu, Sep 3, 2009 at 5:32 AM, Vladislav Bolkhovitin<vst at vlnb.net> wrote:
> Chris Worley, on 09/03/2009 08:08 AM wrote:
>>
>> On Wed, Sep 2, 2009 at 2:58 PM, Chris Worley<worleys at gmail.com> wrote:
>>>
>>> On Wed, Sep 2, 2009 at 2:00 PM, Bart Van Assche<bart.vanassche at gmail.com>
>>> wrote:
>>>>
>>>> On Wed, Sep 2, 2009 at 9:53 PM, Chris Worley<worleys at gmail.com> wrote:
>>>>>
>>>>> On Wed, Sep 2, 2009 at 1:31 PM, Bart Van
>>>>> Assche<bart.vanassche at gmail.com> wrote:
>>>>>>
>>>>>> On Tue, Sep 1, 2009 at 1:04 AM, Chris Worley<worleys at gmail.com> wrote:
>>>>>>>
>>>>>>> [ ... ]
>>>>>>> I've found a good kernel/scst mix to easily repeat this; I can get it
>>>>>>> to repeatedly hang w/ 8K block transfers running Ubuntu 9.04 w/ the
>>>>>>> 2.6.27-14-server kernel on _both_ target and initiator (i.e. no WinOF
>>>>>>> or OFED at all) and SCST rev 1062 on the target using one drive
>>>>>>> (performance is >600MB/s, >80K IOPS, on the 8KB block sizes being
>>>>>>> used).
>>>>>>> [ ... ]
>>>>>>
>>>>>> Is there a special reason why you are using the 2.6.27-14-server
>>>>>> kernel ? AFAIK the latest Ubuntu 9.04 kernel is 2.6.28-15-server.
>>>>>
>>>>> No special reason other than it didn't get upgraded w/ the rest of the
>>>>> distro... started w/ 8.10.
>>>
>>> I'm upgrading too, to 9.04.
>>
>> I tried the 2.6.28-15-server kernel (along w/ the 9.04 upgrade), and
>> it does repeat the issue.
>>
>> In trying to build a kernel w/ lockdep support as Vlad requested, my
>> lack of Debian knowledge shone through, and, although I believe I
>> followed all the instructions correctly, I'm not sure if I have a
>> 2.6.28-15 or 2.6.28-10 kernel.  Anyway, the issue is still repeatable.
>>
>> Whatever kernel that is, I have SRP hung currently.  What should I
>> look for in /proc/lockd*?
>>
>> I don't think it's a kernel lock... I think it's a protocol lock, as I
>> can rmmod the target kernel modules (scst_vdisk, scst, and ib_srpt)
>> when the initiator gets in this state.
>
> Since you can rmmod SCST modules, then this shouldn't be SCST or backstorage
> SW/HW issue, because that means there are no stuck or lost SCSI commands.

At least on the target side.  The initiator could think there are
outstanding commands, when they were actually lost on the target (or
the target completed them, and the initiator is in error not thinking
they are completed).

> So, it should be issue of either SRP target/initiator, or OFED on the target
> or initiator, or your IB hardware on any node.

I've used a couple of initiators (different systems) w/ different
OSes, w/ different IB cards (all QDR) and different IB stacks
(built-in vs. OFED) and can repeat the problem in all but the
RHEL5.2/OFED 1.4.1 target and initiator (but, if the initiator is
WinOF and the target is RHEL5.2/OFED1.4.1, then the problem does
repeat).

>
> You should enable lockdep on both target and initiator (better with other
> kernel debug facilities enabled, see the attached file as a sample) and
> reproduce the issue.

That's done and reported in another response; it doesn't seem to be a
lock issue.

> There is a big chance that those facilities will spot
> what's going on wrong there.

I applied the .config changes you suggested, and the kernel was
certainly more verbose, but I don't think added any information.  When
the drives are attached over SRP, I see the following message:

[  454.317328] sd 4:0:0:3: [sde] Attached SCSI disk
[  454.317340] kobject: 'scsi_device' (ffff8804234a3aa0):
kobject_add_internal: parent: '4:0:0:3', set: '<NULL>'
[  454.317350] kobject: '4:0:0:3' (ffff880423cd2780):
kobject_add_internal: parent: 'scsi_device', set: 'devices'
[  454.317378] kobject: '4:0:0:3' (ffff880423cd2780): kobject_uevent_env
[  454.317390] kobject: '4:0:0:3' (ffff880423cd2780): fill_kobj_path:
path = '/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host4/target4:0:0/4:0:0:3/scsi_device/4:0:0:3'
[  454.317437] kobject: 'scsi_generic' (ffff8804234a3c38):
kobject_add_internal: parent: '4:0:0:3', set: '<NULL>'
[  454.317447] kobject: 'sg5' (ffff88042ac4ecb8):
kobject_add_internal: parent: 'scsi_generic', set: 'devices'
[  454.317489] kobject: 'sg5' (ffff88042ac4ecb8): kobject_uevent_env
[  454.317500] kobject: 'sg5' (ffff88042ac4ecb8): fill_kobj_path: path
= '/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host4/target4:0:0/4:0:0:3/scsi_generic/sg5'
[  454.317523] sd 4:0:0:3: Attached scsi generic sg5 type 0

Is there somewhere else to look for problems?

Thanks,

Chris
>
> Vlad
>
>> Thanks,
>>
>> Chris
>>>
>>> Chris
>>>>>
>>>>> Do you think that kernel is better?
>>>>
>>>> I noticed this while trying to reproduce this issue. I have no opinion
>>>> yet about which of these two kernels is better. I'll downgrade the
>>>> Ubuntu kernel in my setup.
>>>>
>>>> Bart.
>>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Let Crystal Reports handle the reporting - Free Crystal Reports 2008
>> 30-Day trial. Simplify your report design, integration and deployment - and
>> focus on what you do best, core application coding. Discover what's new with
>> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
>> _______________________________________________
>> Scst-devel mailing list
>> Scst-devel at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scst-devel
>>
>
>



More information about the general mailing list