[ofa-general] [PATCH 1/4] ib/ipoib: specify TClassand FlowLabelwith PR queries for QoS support

Tue Aug 7 19:05:26 PDT 2007

On Tue, Aug 07, 2007 at 05:09:53PM -0700, Sean Hefty wrote:
> >Well, the MTU isn't explicity carried in the headers, but if you send
> >a 2K packet into a path that only supports 1K MTU then it will be
> >discarded. In that sense the MTU is included in the headers.
> 
> This is what I meant by the MTU is implied by the other fields.  I was thinking
> about it this way.  If a PR query contains the SLID, DLID, SL, I would expect
> the SA to lookup the MTU for this path and return it.  Is there any advantage or
> reason to include the MTU in such a query?

Er, IPoIB does not do that though?

It should create a PR with DGID, SGID, TClass, Pkey, etc based on the
IP L2 information from the ARP/ND packet. The result of that query is
then used to wrap IP datagrams in UD datagrams. Thus it must ask the
SA for a path with a minimum MTU large enough to carry the largest IP
datagram.

> Taking this across subnets, if a PR query contains the SGID, DGID, TC, and FL,
> does this change whether the MTU should be specified?  I'm not trying to argue
> against including the MTU, but I don't know if IPoIB or the SA should specify
> it.  (And I'm neither an IPoIB nor SA expert.)

Ah, well, MTU is really used as both something you request and
something the SA returns. In the case of datagram communication the
MTU is very important since datagram fragmentation is impossible. In
those cases the end points must ask for paths that meet their MTU
requirements (there may be switching paths that do not, and without
guidance the SA is free to return anything)

For RC, MTU is something that should not generally be requested and
the returned value from the SA should be used to configure the
connection. This gives the SA freedom to return paths across multiple
switching paths. This is really only because the RC message size is
not impacted by the connection MTU.

> >The MTU of every unicast path used must be greater than the Linux
> >interface MTU so that the stack produces correctly sized fragments.
> 
> As you mentioned, this doesn't holds for IPoIB-CM.  The MTU of the path is less
> than the MTU sent by the stack, and the path MTUs could differ, including being
> less than the broadcast MTU.  (I don't know that the implementation supports
> different path MTUs, but it could in theory.)

Right, RC is handled differently than UD/UC when talking about MTU..

> Couldn't IPoIB fragment the packets if it needed?  (Not sure what that would do
> to the performance.)  How does IPoIB-CM handle the case where the device MTU is,
> say, 64k, but the remote side only supports UD?

There is no provision for fragment identification and reassembly in
the IPoIB RFC.

AFAIK, when using the 64K MTU setting for IPoIB, if the remote side
doesn't support RC then things go wonky. For TCP things *might* be saved
by path mtu discovery - but PMTU is driven by ICMP errors which are
not generated by an IB network. I suspect that if you use IPoIB a
mixed configuration like that you are going to want to have routing
table entries that override the MTU for non-RC capable destinations.
But I haven't tried this..

Jason