[ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device

Jim Mott jimmott at austin.rr.com
Wed Sep 26 14:58:53 PDT 2007


This problem comes about because ib_query_device() has only one
field (max_sge) to return all types of SGE maximums.  This value
must work for receive WQEs, send WQEs, and all the permutations
of QP type and hardware.

A minimal API change that could help would be to add two new fields
to ib_device_attr structure returned by ib_query_device:
  - delta_sge_sg
  - delta_sge_rd

The behavior would be that in all cases using max_sge for send or
receive SGE count in create_qp would always succeed.  That means
the current value the drivers return there would have to be reduced
to fix this bug.  All existing codes would continue to run.

If an application wanted to better use hardware that supports
asymmetric SGE counts, it could add the appropriate delta_sge_xx
value to max_sge and get more useful value.

It looks like there is some movement in this direction already
with the fields:
  - max_sge_rd (nes, amso1100, ehca, cxgb3 only)
  - max_srq_sge (amso1100, mthca, mlx4, ehca, ipath only)

If we do add any new fields to deal with this problem, we should
probably make sure all the drivers support them.  I guess that
portable applications check max_sge_rd and max_srp_sge for zero
and use max_sge if they are?

To fully solve the problem and let applications make 
optimal use of hardware, we probably need a new function 
that takes the create_qp parameters along with a list of
OPCODEs to be used (or excluded?) on this QP and returns 
the actual send and receive SGE maximums.

================================

The issue with the "shrinking WQE" (sorry) is best 
shown by example.  The MLX4 supports a send WQE that
is 1008 bytes long unless you are doing RDMA_READ 
when you can only use 512 byte send WQEs.  A
receive WQE can be 512 bytes maximum.  

Ignore the non-power-of-2 size stuff and just
assume that all WQEs are fixed size power-of-2
with maximums of 1024 or 512.  This is 63 or 32
segments.  One segment for ctrl means that we 
get max_sge_rq of 31 and a matrix for max_sge_sq:

RDMA_READ : 30 (raddr)
RDMA_WRITE: 61 (raddr)
SEND-RC   : 62 
SEND-UD   : 59 (AV, AV, dest)

The problem with:
  if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz)

is that since max_sq_desc is 1008 instead of 1024 we are forced
to use wqe_shift of 9 instead of 10.  That means that even
though the hardware supports an RC send with 62 SGEs, the most
we can actually ask for is 31.

================================

All this brings us back to the original bug.

ib_query_device() returned max_sge=32, so we use it in max_send_sge 
when we create a QP.  

In mlx4/qp.c, we verify max_send_sge <= max_sq_sg (62; 1008-16)
in a sanity check at entry to set_kernel_sq_size().  This passes.

Then we calculate the size of the WQE based on the QP type:
  cap->max_send_sge * sizeof (struct mlx4_wqe_data_seg) +
  send_wqe_overhead(type);
The send_wqe_overhead(RC) function returns 3 segments:
  - ctrl + atomic + raddr
So we get a WQE size of 560 bytes (32 SGEs + 3 overhead
segments) and this fails the power-of-2 test because 1024 
is greater than 1008.

Sorry for all the words.

-----Original Message-----
From: Roland Dreier [mailto:rdreier at cisco.com] 
Sent: Wednesday, September 26, 2007 3:03 PM
To: Jim Mott
Cc: general at lists.openfabrics.org
Subject: Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device

 > 1) ib_create_qp() fails with max_sge 
 >   If you use ib_query_device() to return the device specific 
 > attribute max_sge, it seems reasonable to expect you can create
 > a QP with max_send_sge=max_sge.  The problem is that this often
 > fails.
 > 
 >   The reason is that depending on the QP type (RC, UD, etc.) and
 > how the QP will be used (send, RDMA, atomic, etc.), there can be
 > extra segments required in the WQE that eat up SGE entries.  So
 > while some send WQE might have max_sge available SGEs, many will
 > not.

 >   This issue may need API extensions to definitively resolve.  In
 > the short term, it would be very nice if max_sge reported by 
 > ib_query_device() could always return a value that ib_create_qp()
 > could use.  Think of it as the minimum max_send_sge value that
 > will work for all QP types.

The intention is that any attempt to create a QP with the maximum
number of S/G entries as reported by query device should succeed.
However, as you note there may be issues that make this fail, but I
would consider them as bugs to be fixed.

You mention API extensions to handle this -- do you have any concrete
ideas?  In the past we've talked a little about this, but I don't
think anyone has suggested any changes that would help matters while
still keeping the API no more complex than it already is.

 >   The recent patch to support shrinking WQEs introduces a 
 > behavior that creates a big difference between the mlx4 
 > supported send SGEs (checked against 61, should be 59 or 60,
 > and reported in ib_query_device as 32 to equal receive side
 > max_rq_sg value).  

I'm not sure I understand this.  What's the new behavior?

Are you trying to take advantage of the fact that using non-power-of-2
size send WQEs would let you have a send queue with more than 32 S/G
entries?  I think doing that actually would require a change in the
API to allow different values for max_sge_rq and max_sge_sq to be
reported from ib_query_device().  Which in turn would break the
userspace ABI, etc, etc. and leaves me wondering if it's really worth it.

(BTW I hate the "shrinking WQE" terminology for this, although
obviously you weren't the one to introduce it)

 - R.




More information about the general mailing list