[ofa-general] Manipulating Credits in Infiniband
Nifty Tom Mitchell
niftyompi at niftyegg.com
Sun Aug 16 21:44:37 PDT 2009
On Thu, Aug 13, 2009 at 02:41:37AM -0400, Ashwath Narasimhan wrote:
>
> Dear Tom/all
>
> I understand the end to end credit based flow control at the link
> layer where we have a 32 bit Flow control packet being sent for each VL
> (with FCCL and FCTBS fields) but I fail to understand where this scheme
> is implemented in the driver. (OFED linux- 1.4 stack, hw-mthca) . I can
> see a file with a credit table mapped to different credits counts and
> another that computes the AETH based on this credit table.
>
> 1. Is this the place where the flow control packets are formulated?
If you do some back of the envelope computations you will find that
much of the low level flow control must be done in a firmware/ hardware state
machine. The maximum interrupt service rate and the maximum IB packet
rates are not even close. Thus you will not find it in the driver.
So as you scan the Mellanox driver you will discover
a hand off from the driver to the firmware. In some cases the driver
will initialize the link layer and you will see this. You might
see it in the error recovery/ reset part of the driver but for hw-mtca
I think it is well hidden in firmware. Error recovery is one place to
look because it might need to restore the credit balance so data can flow.
Without credit data does not flow.
> 2. If yes, I don't see them computing this for each VL. why? If no, is
> it a mid layer flow control?
VL's are interesting, the IB specification is full of may, might, optional, future, etc
and as such most hardware does the minimum with VL. This is changing.
One valuable thing to research is the other credit based link level interfaces
on common modern hardware. i.e. AMD uses this on their HT links. See also ATM links,
Fibre Channel, PCIe...
Also identify management packets, reliable and unreliable transport.
> 3. And thats why I have this basic question--> is the link layer
> implemented as part of OFED stack at all? or does it go into the
> hardware HCA as firmware? As I understand the hardware vendor only
> provides verbs to communicate with the HCA.
Link layer is 99 and 44/100% hardware.
> Pardon me if i am bundling you all with a lot with questions. I am new
> to all this and I am trying my best to understand the stack.
You might compare and contrast the QLogic drivers and the Mellanox
drivers. The hardware design is very different. To that point
the older QLogic hardware (Pathscale) has no firmware in the way that
Mellanox does. This will let you see informative learning differences.
In general there is no need to manipulate credits unless you are designing
hardware or you are a hardware vendor.
> Thank you,
>
> Ashwath
>
> On Tue, Aug 11, 2009 at 10:37 PM, Nifty Tom Mitchell
> <[1]niftyompi at niftyegg.com> wrote:
>
> On Mon, Aug 10, 2009 at 12:11:22PM -0400, Ashwath Narasimhan wrote:
> >
> > I looked into the infiniband driver files. As I understand, in
> order to
> > limit the data rate we manipulate the credits on either ends.
> Since the
> > number of credits available depends on the receiver's work receive
> > queue size, I decided to limit the queue size to say 5 instead of
> 8192
> > (reference---> ipoib.h, IPOIB_MAX_QUEUE_SIZE to say 3 since my
> higher
> > layer protocol is ipoib). I just want to confirm if I am doing the
> > right thing?
>
> Data rate is not manipulated by credits.
> Credits and queue sizes are different and have different purposes.
> Visit the Infiniband Trade Association web site and grab the IB
> specifications to understand some of the hardware level parts.
> [2]http://www.infinibandta.org/
> InfiniBand offers credit based flow control and given the nature of
> modern IB switches and processors a very small credit count can
> still
> result in full data rate. Having said that flow control is the
> lowest
> level throttle in the system. Reducing the credit count forces the
> higher levels in the protocol stack to source or sink the data
> through
> the hardware before any more can be delivered. Thus flow control
> can
> simplify the implementation of higher level protocols. It can also
> be used
> to cost reduce or simplify hardware design (smaller hardware
> buffers).
> The IB specifications are way too long. Start with this FAQ.
>
> [3]http://www.mellanox.com/pdf/whitepapers/InfiniBandFAQ_FQ_100.pdf
> The IB specification is way too full of optional features. A vendor
> may
> have XYZ working fine and dandy on one card and since it is optional
> not
> at all on another.
> The various queue sizes for the various protocols built on top of
> IB establish transfer behavior in keeping with system interrupt,
> system process time slice, system kernel activity loads and needs.
> It is counter intuitive but in some cases small queues result in
> more responsive and agile systems, especially in the presence of
> errors.
> Since there are often multiple protocols on the IB stack all
> protocols
> will be impacted by credit tinkering. Most vendors know their
> hardware
> so most drivers will have credit related code optimum.
> In the case of TCP/IP the interaction between IB bandwidths&MTU
> (IPoIB),
> ethernet bandwidth&MTU and even localhost (127.0.0.1) bandwidth&MTU
> can
> be "interesting" depending on host names, subnets, routing etc.
> TCP/IP
> has lots of tuning flags well above the IB driver. I see 500+
> net.*
> sysctl knobs on this system.
> As you change things do make the changes on all the moving parts,
> benchmark
> and keep a log. Since there are multiple IB hardware vendors
> it is important to track hardware specifics. "lspci" is a good tool
> to gather chip info. With some cards you also need specifics about
> the active firmware.
> So go forth (RPN forever) and conquer.
> --
> T o m M i t c h e l l
> Found me a new hat, now what?
>
> --
> regards,
> Ashwath
>
> References
>
> 1. mailto:niftyompi at niftyegg.com
> 2. http://www.infinibandta.org/
> 3. http://www.mellanox.com/pdf/whitepapers/InfiniBandFAQ_FQ_100.pdf
--
T o m M i t c h e l l
Found me a new hat, now what?
More information about the general
mailing list