[Scst-devel] [ofa-general] WinOF_2_0_5/SRP initiator: slow reads and eventually hangs

Chris Worley worleys at gmail.com
Wed Sep 16 12:41:20 PDT 2009


On Wed, Sep 16, 2009 at 12:15 PM, Vladislav Bolkhovitin <vst at vlnb.net> wrote:
> Chris Worley, on 09/16/2009 12:51 AM wrote:
>>
>> On Tue, Sep 15, 2009 at 11:10 AM, Vladislav Bolkhovitin <vst at vlnb.net>
>> wrote:
>>>
>>> Chris Worley, on 09/15/2009 09:01 PM wrote:
>>>>
>>>> On Tue, Sep 15, 2009 at 10:57 AM, Vladislav Bolkhovitin <vst at vlnb.net>
>>>> wrote:
>>>>>
>>>>> Chris Worley, on 09/15/2009 08:53 PM wrote:
>>>>>>
>>>>>> On Tue, Sep 15, 2009 at 10:43 AM, Vladislav Bolkhovitin <vst at vlnb.net>
>>>>>> wrote:
>>>>>>>
>>>>>>> Chris Worley, on 09/15/2009 07:50 PM wrote:
>>>>>>>>
>>>>>>>> On Tue, Sep 15, 2009 at 12:10 AM, Bart Van Assche
>>>>>>>> <bart.vanassche at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> On Tue, Sep 15, 2009 at 1:03 AM, Chris Worley <worleys at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 14, 2009 at 12:51 PM, Vladislav Bolkhovitin
>>>>>>>>>> <vst at vlnb.net>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Chris Worley, on 09/11/2009 11:50 PM wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I've definitely removed the switch/firmware from being the
>>>>>>>>>>>> cause.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm thinking the reason you can't repeat the test may be latency
>>>>>>>>>>>> related.  We get ~50usecs average latency (on small block
>>>>>>>>>>>> sizes),
>>>>>>>>>>>> which can't be achieved using regular SSD's (and rotating drives
>>>>>>>>>>>> are
>>>>>>>>>>>> nowhere close).  Maybe a ramdisk would help repeat the issue.
>>>>>>>>>>>
>>>>>>>>>>> I think you should try to reproduce the problem with ramdisk or
>>>>>>>>>>> nullio.
>>>>>>>>>>> By
>>>>>>>>>>> so you will eliminate possible influence of the SSD backend.
>>>>>>>>>>
>>>>>>>>>> W/ 12GB RAM in the target, I created a 7GB ramdisk:
>>>>>>>>>>
>>>>>>>>>> mount -t ramfs -o size=7g ramfs /mnt/
>>>>>>>>>> dd if=/dev/zero of=/mnt/foo bs=1024k count=7000
>>>>>>>>>> echo "open ramdisk /mnt/foo" > /proc/scsi_tgt/vdisk/vdisk
>>>>>>>>>> echo "add ramdisk 2" >/proc/scsi_tgt/groups/Default/devices
>>>>>>>>>>
>>>>>>>>>> Then, on the initiator, I tested it... and it hung during
>>>>>>>>>> sequential
>>>>>>>>>> 8KB block reads:
>>>>>>>>>>
>>>>>>>>>> fio --rw=read --bs=8k --numjobs=64 --iodepth=64 --sync=0
>>>>>>>>>> --direct=1
>>>>>>>>>> --randrepeat=0 \
>>>>>>>>>>  --group_reporting --ioengine=libaio --filename=/dev/sde
>>>>>>>>>> --name=test
>>>>>>>>>> --loops=10000 --runtime=600
>>>>>>>>>>
>>>>>>>>>> Note that I was running the SM on the target this time too.
>>>>>>>>>
>>>>>>>>> Which Linux distro was installed on the inititiator and on the
>>>>>>>>> target
>>>>>>>>> ? And if applicable, which OFED version ? Which kernel messages
>>>>>>>>> were
>>>>>>>>> logged by SRPT around the time the issue occurred (after having
>>>>>>>>> enabled SRPT logging first) ?
>>>>>>>>
>>>>>>>> As logging hadn't helped this issue previously, I've not been
>>>>>>>> enabling
>>>>>>>> it.  That plus the kernel hacks needed to invoke logging, it's not
>>>>>>>> worth enabling.
>>>>>>>>
>>>>>>>> This was with Ubuntu 8.10, built-in IB on the 2.6.27-14-server
>>>>>>>> kernel.
>>>>>>>>
>>>>>>>> I couldn't get ramdisks working w/ SCST in RHEL5.2.  When running:
>>>>>>>>
>>>>>>>> echo "open ramdisk /mnt/foo" > /proc/scsi_tgt/vdisk/vdisk
>>>>>>>>
>>>>>>>> I get the error:
>>>>>>>>
>>>>>>>> dev_vdisk: ***ERROR***: Wrong f_op or FS doesn't have required
>>>>>>>> capabilities
>>>>>>>>
>>>>>>>> ... which doesn't occur in the Ubuntu kernel, so I've been unable to
>>>>>>>> test RHEL kernels w/ ramdisks.  In general, this problem occurs w/
>>>>>>>> 8KB
>>>>>>>> and smaller blocks w/ the Ubuntu kernels, and 2KB and smaller blocks
>>>>>>>> w/ RHEL kernels.
>>>>>>>
>>>>>>> Use ramfs instead.
>>>>>>
>>>>>> Do you mean:
>>>>>>
>>>>>> mount -t ramfs -o size=7g ramfs /mnt/
>>>>>
>>>>> You should then create a file on it and use it.
>>>>
>>>> That's what I'm doing, I believe.  From above:
>>>>
>>>>>>>>>> mount -t ramfs -o size=7g ramfs /mnt/
>>>>>>>>>> dd if=/dev/zero of=/mnt/foo bs=1024k count=7000
>>>>>>>>>> echo "open ramdisk /mnt/foo" > /proc/scsi_tgt/vdisk/vdisk
>>>>>>>>>> echo "add ramdisk 2" >/proc/scsi_tgt/groups/Default/devices
>>>>
>>>> ... but the "open", on RHEL5.2 kernel 2.6.18-92.el5, generates the
>>>> following kernel messages:
>>>>
>>>> dev_vdisk: Registering virtual FILEIO device ramdisk
>>>> scst: Processing thread started, PID 9629
>>>> scst: Processing thread started, PID 9630
>>>> scst: Processing thread started, PID 9631
>>>> scst: Processing thread started, PID 9632
>>>> scst: Processing thread started, PID 9633
>>>> dev_vdisk: ***ERROR***: Wrong f_op or FS doesn't have required
>>>> capabilities
>>>> scst: ***ERROR***: New device handler's vdisk attach() failed: -22
>>>> scst: Processing thread PID 9629 finished
>>>> scst: Processing thread PID 9630 finished
>>>> scst: Processing thread PID 9631 finished
>>>> scst: Processing thread PID 9632 finished
>>>> scst: Processing thread PID 9633 finished
>>>> scst: Failed to attach to virtual device ramdisk
>>>>
>>>> Chris
>>>>>>
>>>>>> ?
>>>>>>
>>>>>> That's what I'm doing.
>>>
>>> That's strange. I'm doing it all the time, although with not so old
>>> kernels
>>> as 2.6.18.
>>
>> In lots of testing today, I've seen this panic twice on the Ubuntu 8.10
>> targets:
>>
>> [  330.155992] ib_srpt: disconnected session
>> 0x00247100000000460024710000000046 because a new SRP_LOGIN_REQ has
>> been received.
>> [  357.207046] ib_srpt: srpt_xmit_response: tag= 17 channel in bad state 2
>> [  357.207052] ib_srpt: disconnected session
>> 0x00247100000000460024710000000046 because a new SRP_LOGIN_REQ has
>> been received.
>> [  357.207100] ib_srpt: srpt_xmit_response: tag= 47 channel in bad state 2
>> [  357.207104] scst: ***ERROR***: Target driver ib_srpt
>> xmit_response() returned fatal error
>> [  357.241429] scst: ***ERROR***: Target driver ib_srpt
>> xmit_response() returned fatal error
>> [  357.250234] ------------[ cut here ]------------
>> [  357.250537] ib_srpt: srpt_xmit_response: tag= 26 channel in bad state 2
>> [  357.250539] scst: ***ERROR***: Target driver ib_srpt
>> xmit_response() returned fatal error
>> [  357.250550] ib_srpt: srpt_xmit_response: tag= 38 channel in bad state 2
>> [  357.250553] scst: ***ERROR***: Target driver ib_srpt
>> xmit_response() returned fatal error
>> [  357.250560] ib_srpt: srpt_xmit_response: tag= 27 channel in bad state 2
>> <repeated many times>
>> [  357.301253] kernel BUG at /root/scst/scst/src/scst_targ.c:3089!
>> [  357.301253] invalid opcode: 0000 [1] SMP
>> [  357.301253] CPU 0
>> ...
>> [  357.301253] RIP: 0010:[<ffffffffa04759f6>]  [<ffffffffa04759f6>]
>> scst_tgt_cmd_done+0x26/0x30 [scst]
>> [  357.301253] RSP: 0018:ffff88039ad27b50  EFLAGS: 00010297
>> [  357.301253] RAX: 0000000000000200 RBX: ffff8803ad9c68f8 RCX:
>> 0000000000000000
>> [  357.301253] RDX: 00000000ffffffff RSI: 0000000000000000 RDI:
>> ffff8803ad9c68f8
>> [  357.301253] RBP: ffff88039ad27b50 R08: 0000000000000000 R09:
>> 0000000000000000
>> [  357.301253] R10: ffff88039ad277c0 R11: ffff88041ad278cf R12:
>> ffff8803c2972180
>> [  357.301253] R13: ffff88039ada0000 R14: 0000000000000001 R15:
>> ffff8803fb00c2b0
>> [  357.301253] FS:  0000000000000000(0000) GS:ffffffff807dd000(0000)
>> knlGS:0000000000000000
>> [  357.301253] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>> [  357.301253] CR2: 00007f9281e64000 CR3: 0000000000201000 CR4:
>> 00000000000006e0
>> [  357.301253] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [  357.301253] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>> 0000000000000400
>> [  357.301253] Process ib_cm/0 (pid: 8299, threadinfo
>> ffff88039ad26000, task ffff88039ad40000)
>> [  357.301253] Stack:  ffff88039ad27b80 ffffffffa04c0c47
>> ffff88039a8db900 ffff8803c2972180
>> [  357.301253]  ffff8803fb00c240 ffff8803fb00c284 ffff88039ad27bc0
>> ffffffffa04c0d93
>> [  357.301253]  ffff88042a4959c0 ffff88042a9d7800 ffff88042544da00
>> ffff88042a9d7898
>> [  357.301253] Call Trace:
>> [  357.301253]  [<ffffffffa04c0c47>] srpt_abort_scst_cmd+0xd7/0x160
>> [ib_srpt]
>> [  357.301253]  [<ffffffffa04c0d93>] srpt_release_channel+0xc3/0x190
>> [ib_srpt]
>> [  357.301253]  [<ffffffffa04c0e82>]
>> srpt_find_and_release_channel+0x22/0x30 [ib_srpt]
>> [  357.301253]  [<ffffffffa04c227d>] srpt_cm_handler+0x6d/0xbb8 [ib_srpt]
>
> It's because srpt called scst_tgt_cmd_done() when the corresponding command
> hasn't yet been sent to xmit_response() callback, so srpt should use another
> function to abort commands in this state.

Could this be related to the hang (i.e. the command has been aborted
before xmit_response has been called... but w/o causing a panic)?

Thanks,

Chris
>
> Vlad
>
>



More information about the general mailing list