[Scst-devel] [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and eventually hangs

Chris Worley worleys at gmail.com
Thu Sep 3 16:20:42 PDT 2009


On Thu, Sep 3, 2009 at 11:38 AM, Chris Worley<worleys at gmail.com> wrote:
> On Thu, Sep 3, 2009 at 5:32 AM, Vladislav Bolkhovitin<vst at vlnb.net> wrote:
>> Chris Worley, on 09/03/2009 08:08 AM wrote:
>>>
>>> On Wed, Sep 2, 2009 at 2:58 PM, Chris Worley<worleys at gmail.com> wrote:
>>>>
>>>> On Wed, Sep 2, 2009 at 2:00 PM, Bart Van Assche<bart.vanassche at gmail.com>
>>>> wrote:
>>>>>
>>>>> On Wed, Sep 2, 2009 at 9:53 PM, Chris Worley<worleys at gmail.com> wrote:
>>>>>>
>>>>>> On Wed, Sep 2, 2009 at 1:31 PM, Bart Van
>>>>>> Assche<bart.vanassche at gmail.com> wrote:
>>>>>>>
>>>>>>> On Tue, Sep 1, 2009 at 1:04 AM, Chris Worley<worleys at gmail.com> wrote:
>>>>>>>>
>>>>>>>> [ ... ]
>>>>>>>> I've found a good kernel/scst mix to easily repeat this; I can get it
>>>>>>>> to repeatedly hang w/ 8K block transfers running Ubuntu 9.04 w/ the
>>>>>>>> 2.6.27-14-server kernel on _both_ target and initiator (i.e. no WinOF
>>>>>>>> or OFED at all) and SCST rev 1062 on the target using one drive
>>>>>>>> (performance is >600MB/s, >80K IOPS, on the 8KB block sizes being
>>>>>>>> used).
>>>>>>>> [ ... ]
>>>>>>>
>>>>>>> Is there a special reason why you are using the 2.6.27-14-server
>>>>>>> kernel ? AFAIK the latest Ubuntu 9.04 kernel is 2.6.28-15-server.
>>>>>>
>>>>>> No special reason other than it didn't get upgraded w/ the rest of the
>>>>>> distro... started w/ 8.10.
>>>>
>>>> I'm upgrading too, to 9.04.
>>>
>>> I tried the 2.6.28-15-server kernel (along w/ the 9.04 upgrade), and
>>> it does repeat the issue.
>>>
>>> In trying to build a kernel w/ lockdep support as Vlad requested, my
>>> lack of Debian knowledge shone through, and, although I believe I
>>> followed all the instructions correctly, I'm not sure if I have a
>>> 2.6.28-15 or 2.6.28-10 kernel.  Anyway, the issue is still repeatable.
>>>
>>> Whatever kernel that is, I have SRP hung currently.  What should I
>>> look for in /proc/lockd*?
>>>
>>> I don't think it's a kernel lock... I think it's a protocol lock, as I
>>> can rmmod the target kernel modules (scst_vdisk, scst, and ib_srpt)
>>> when the initiator gets in this state.
>>
>> Since you can rmmod SCST modules, then this shouldn't be SCST or backstorage
>> SW/HW issue, because that means there are no stuck or lost SCSI commands.
>
> At least on the target side.  The initiator could think there are
> outstanding commands, when they were actually lost on the target (or
> the target completed them, and the initiator is in error not thinking
> they are completed).
>
>> So, it should be issue of either SRP target/initiator, or OFED on the target
>> or initiator, or your IB hardware on any node.
>
> I've used a couple of initiators (different systems) w/ different
> OSes, w/ different IB cards (all QDR) and different IB stacks
> (built-in vs. OFED) and can repeat the problem in all but the
> RHEL5.2/OFED 1.4.1 target and initiator (but, if the initiator is
> WinOF and the target is RHEL5.2/OFED1.4.1, then the problem does
> repeat).

Here's a twist: I used the Ubuntu initiator w/ one of the RHEL
targets, and the RHEL initiator (same machine as was running WinOF
from the beginning of this thread) w/ one of the Ubuntu targets: in
both cases, the problem does not repeat.

That makes it sound like OFED is the cure on either side of the
connection, but does not explain the issue w/ WinOF (which does fail
w/ either Ununtu or RHEL targets).

Chris
>
>>
>> You should enable lockdep on both target and initiator (better with other
>> kernel debug facilities enabled, see the attached file as a sample) and
>> reproduce the issue.
>
> That's done and reported in another response; it doesn't seem to be a
> lock issue.
>
>> There is a big chance that those facilities will spot
>> what's going on wrong there.
>
> I applied the .config changes you suggested, and the kernel was
> certainly more verbose, but I don't think added any information.  When
> the drives are attached over SRP, I see the following message:
>
> [  454.317328] sd 4:0:0:3: [sde] Attached SCSI disk
> [  454.317340] kobject: 'scsi_device' (ffff8804234a3aa0):
> kobject_add_internal: parent: '4:0:0:3', set: '<NULL>'
> [  454.317350] kobject: '4:0:0:3' (ffff880423cd2780):
> kobject_add_internal: parent: 'scsi_device', set: 'devices'
> [  454.317378] kobject: '4:0:0:3' (ffff880423cd2780): kobject_uevent_env
> [  454.317390] kobject: '4:0:0:3' (ffff880423cd2780): fill_kobj_path:
> path = '/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host4/target4:0:0/4:0:0:3/scsi_device/4:0:0:3'
> [  454.317437] kobject: 'scsi_generic' (ffff8804234a3c38):
> kobject_add_internal: parent: '4:0:0:3', set: '<NULL>'
> [  454.317447] kobject: 'sg5' (ffff88042ac4ecb8):
> kobject_add_internal: parent: 'scsi_generic', set: 'devices'
> [  454.317489] kobject: 'sg5' (ffff88042ac4ecb8): kobject_uevent_env
> [  454.317500] kobject: 'sg5' (ffff88042ac4ecb8): fill_kobj_path: path
> = '/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host4/target4:0:0/4:0:0:3/scsi_generic/sg5'
> [  454.317523] sd 4:0:0:3: Attached scsi generic sg5 type 0
>
> Is there somewhere else to look for problems?
>
> Thanks,
>
> Chris
>>
>> Vlad
>>
>>> Thanks,
>>>
>>> Chris
>>>>
>>>> Chris
>>>>>>
>>>>>> Do you think that kernel is better?
>>>>>
>>>>> I noticed this while trying to reproduce this issue. I have no opinion
>>>>> yet about which of these two kernels is better. I'll downgrade the
>>>>> Ubuntu kernel in my setup.
>>>>>
>>>>> Bart.
>>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Let Crystal Reports handle the reporting - Free Crystal Reports 2008
>>> 30-Day trial. Simplify your report design, integration and deployment - and
>>> focus on what you do best, core application coding. Discover what's new with
>>> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
>>> _______________________________________________
>>> Scst-devel mailing list
>>> Scst-devel at lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scst-devel
>>>
>>
>>
>



More information about the general mailing list