[Openib-windows] [ANNOUNCE] Build 1.0.0.566 posted

Mon Jan 29 06:28:33 PST 2007

see my comments below.

  _____  

From: openib-windows-bounces at openib.org
[mailto:openib-windows-bounces at openib.org] On Behalf Of Fab Tillier
Sent: Thursday, January 25, 2007 7:34 PM
To: Yossi Leybovich; openib-windows at openib.org
Subject: Re: [Openib-windows] [ANNOUNCE] Build 1.0.0.566 posted

Hi Yossi,

A question about r538:

------------------------------------------------------------------------
r538 | sleybo | 2006-11-07 08:54:25 +0200 (Tue, 07 Nov 2006) | 3 lines

[IBAL] Compliance tests
1. pass switch_info to the HCA - compliance test C13-026
2. Not use AL cashe for node_description node_info to force Mkey check
-compliance test C14-018
------------------------------------------------------------------------

Have you tested to see what the effects of removing the cache for node
description and node info are on SM sweeps when the system is busy?

I initially added the cache for these so that the response could be issued
in the context of the CQ callback for the special QP (thus at
DISPATCH_LEVEL).  Without the cache processing requires a call to the local
MAD verb, which has to be scheduled on a passive-level thread.  If the
system is very busy doing I/O (i.e. lots of small packets in Iometer over
IPoIB), I have seen cases where the local MAD thread does not run fast
enough so the response time for the MAD is too long and the SM declares the
node as having failed and removes it from the fabric.  This is pretty nasty,
as suddenly all IB multicast group memberships are lost, but there's no
indication to the host that things went awry.

There are two solutions for this, one is more of a temporary fix than the
other IMO.  First, the temporary fix: perform the MKey check in software, so
that the MAD response for as many MADs can be generated at DISPATCH_LEVEL
from the context of the special QP's CQ callback.  This should maintain
compliance while also keeping response times for MADs as short as possible.
[Yossi Leybovich] To solve the problem of denial of service I will add
simple m-key check . In any case of error (or not trivial m-key check (i.e
m-key =0) I will disable the cache and move the MAD to the FW

( I don't want to count the m key violation and of course not to add code
that generate traps).

This will reduce the handling of good flow packets.

The second solution is to make the local MAD verb asynchronous.  The HCA
handles the command asynchronously anyway, so this is a more natural fit
given the HW design.  This would mean the local MAD verb would be called
directly from the CQ callback (at DISPATCH_LEVEL), and would return pending.
When the local MAD is processed and the HCA generates the response to the
EQ, the driver could invoke a callback to indicate completion (again at
DISPATCH_LEVEL) which would send out the response.  This solution eliminates
the thread scheduling issues associated with handling local MAD requests in
a passive-level thread.
[Yossi Leybovich] 

This will require to test our driver with async commands ( I think that
Leonid does not fully support it Leonid?) I don't think we will have the
time to do that in the short future 

We should make sure that systems aren't susceptible to Denial of Service
attacks from someone flooding them with IPoIB traffic (which gets handled at
DISPATCH_LEVEL in IPoIB's CQ callback).  It's bad if an application on one
host can cause another host to be removed from the fabric - there will be no
port down events, no notification to the SM when the host is responsive
again, and the host will not be able to participate properly in the fabric
until the next SM sweep.

-Fab

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20070129/eef12474/attachment.html>