From dillowda at ornl.gov  Fri Aug 20 07:15:40 2010
From: dillowda at ornl.gov (David Dillow)
Date: Fri, 20 Aug 2010 10:15:40 -0400
Subject: [ofa-general] srp sg_tablesize
In-Reply-To: <201008200949.54595.bs_lists@aakef.fastmail.fm>
References: <201008200949.54595.bs_lists@aakef.fastmail.fm>
Message-ID: <1282313740.7441.25.camel@lap75545.ornl.gov>

On Fri, 2010-08-20 at 09:49 +0200, Bernd Schubert wrote:
> In ib_srp.c sg_tablesize is defined as 255. With that value we see lots of IO 
> requests of size 1020. As I already wrote on linux-scsi, that is really sub-
> optimal for DDN storage, as lots of IO requests of size 1020 come up.
> 
> Now the question is if we can safely increase it. Is there somewhere a 
> definition what is the real hardware supported size? And shouldn't we increase 
> sg_tablesize, but also set the .dma_boundary value?

Currently, we limit sg_tablesize to 255 because we can only cache 255
indirect memory descriptors in the SRP_CMD message to the target. That's
due to the count being in an 8 bit field.

It does not have to be this way -- the spec defines that that indirect
descriptors in the message are just a cache, and the target should RDMA
any additional descriptors from the initiator, and then process those as
well. So we could easily take it higher, up to the size of a contiguous
allocation (or bigger, using FMR). However, to my knowledge, no vendor
implements this support.

We could make more descriptors fit in the SRP_CMD by using FMR to make
them virtually contiguous. The initiator currently tries to allocate 512
byte pages, but I think it ends up using 4K pages as I don't think any
HCA supports a smaller FMR page. That's OK -- I'm pretty sure that the
mid-layer isn't going to pass down an SG list of 512 byte sectors, it
would be in pages, but it something I'd have to check to be sure. You
could get ~255 MB request using this method, assuming you didn't run out
of FMR entries (that request would need up to 65,280 entries).

The problem with using FMR in this manner is the failure cases. We have
no way to tell the SCSI mid-layer that it needs to split the request up,
and even if we could there may be certain commands that must not be
split. We could return BUSY if we fail to allocate an FMR entry, but
then we have no guarantee of forward progress. This should be a rare
case, but it's not something we want in a storage system.

So, we would still want to be able to fall back to the RDMA of indirect
descriptors, even if it is very rarely used.

If you can get Cedric to add it to the target, I'll commit to writing
the initiator part. We'd love to have it, as would many of your other
customers.

-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


From dillowda at ornl.gov  Sat Aug 21 09:27:23 2010
From: dillowda at ornl.gov (David Dillow)
Date: Sat, 21 Aug 2010 12:27:23 -0400
Subject: [ofa-general] srp sg_tablesize
In-Reply-To: <AANLkTimMoyEpfYPFSLLqS9ZCg3VyyOQcd4i2zzCQjHMN@mail.gmail.com>
References: <201008200949.54595.bs_lists@aakef.fastmail.fm>
	<AANLkTimMoyEpfYPFSLLqS9ZCg3VyyOQcd4i2zzCQjHMN@mail.gmail.com>
Message-ID: <1282408043.20840.13.camel@obelisk.thedillows.org>

On Sat, 2010-08-21 at 13:14 +0200, Bart Van Assche wrote:
> On Fri, Aug 20, 2010 at 9:49 AM, Bernd Schubert
> <bs_lists at aakef.fastmail.fm> wrote:
> >
> > In ib_srp.c sg_tablesize is defined as 255. With that value we see lots of IO
> > requests of size 1020. As I already wrote on linux-scsi, that is really sub-
> > optimal for DDN storage, as lots of IO requests of size 1020 come up.
> >
> > Now the question is if we can safely increase it. Is there somewhere a
> > definition what is the real hardware supported size? And shouldn't we increase
> > sg_tablesize, but also set the .dma_boundary value?
> 
> (resending as plain text)
> 
> The request size of 1020 indicates that there are less than 60 data
> buffer descriptors in the SRP_CMD request. So you are probably hitting
> another limit than srp_sg_tablesize.

4 KB * 255 descriptors = 1020 KB

IIRC, we verified that we were seeing 255 entries in the S/G list with a
few printk()s, but it has been a few years.

I'm not sure how you came up with 60 descriptors -- could you elaborate
please? 

> Did this occur with buffered (asynchronous) or unbuffered (direct) I/O
> ? And in the first case, which I/O scheduler did you use ?

I'm sure Bernd will speak for his situation, but we've seen it with both
buffered and unbuffered, with the deadline and noop schedulers (mostly
on vendor 2.6.18 kernels). CFQ never gave us larger than 512 KB
requests. Our main use is Lustre, which does unbuffered IO from the
kernel.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


From dillowda at ornl.gov  Tue Aug 24 13:23:16 2010
From: dillowda at ornl.gov (David Dillow)
Date: Tue, 24 Aug 2010 16:23:16 -0400
Subject: [ofa-general] srp sg_tablesize
In-Reply-To: <201008242147.50692.bs_lists@aakef.fastmail.fm>
References: <201008200949.54595.bs_lists@aakef.fastmail.fm>
	<1282313740.7441.25.camel@lap75545.ornl.gov>
	<201008242147.50692.bs_lists@aakef.fastmail.fm>
Message-ID: <1282681396.2425.10.camel@lap75545.ornl.gov>

On Tue, 2010-08-24 at 15:47 -0400, Bernd Schubert wrote:
> On Friday, August 20, 2010, David Dillow wrote:
> > Currently, we limit sg_tablesize to 255 because we can only cache 255
> > indirect memory descriptors in the SRP_CMD message to the target. That's
> > due to the count being in an 8 bit field.
> 
> I think the magic is in srp_map_data(), but I do not find any 8-bit field 
> there? 

The SRP_CMD message is described in the SRP spec, and also by struct
srp_cmd in include/scsi/srp.h. The fields in question are
data_{in,out}_desc_cnt.

> While looking through the code, I also think I found a bug:
> 
> In srp_map_data() 
> 
> count = ib_dma_map_sg()
> 
> Now if something fails, count may become zero and that is not handled at all.

Yes, I think you are correct. I don't think it is possible to hit on any
system arch that one would use IB on, but I'll add to the list of things
I need to fix.

> > It does not have to be this way -- the spec defines that that indirect
> > descriptors in the message are just a cache, and the target should RDMA
> > any additional descriptors from the initiator, and then process those as
> > well. So we could easily take it higher, up to the size of a contiguous
> > allocation (or bigger, using FMR). However, to my knowledge, no vendor
> > implements this support.
> 
> I have no idea if DDN supports it or not, but I'm sure I could figure it out.

You don't; trust me on this. :)

> > We could make more descriptors fit in the SRP_CMD by using FMR to make
> > them virtually contiguous. The initiator currently tries to allocate 512
> > byte pages, but I think it ends up using 4K pages as I don't think any
> > HCA supports a smaller FMR page. That's OK -- I'm pretty sure that the
> > mid-layer isn't going to pass down an SG list of 512 byte sectors, it
> > would be in pages, but it something I'd have to check to be sure. You
> > could get ~255 MB request using this method, assuming you didn't run out
> > of FMR entries (that request would need up to 65,280 entries).
> 
> Hmm, there is already srp_map_frm() and if that fails it already uses an 
> idirect mapping? Or do I completely miss something?

Yes, that tries to use FMR to map the pages and we fall back to indirect
mappings if that fails. We could use FMR to reduce the number of S/G
entries, but we still would need a fallback before we could tell the
SCSI mid-layer that we can handle more than 255 entries.

-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office