[Openib-windows] [ANNOUNCE] Build 1.0.0.566 posted

Wed Jan 31 00:08:54 PST 2007

As far as I checked with Leonid the SMP is working in the HCA.
Still its not compliance as it does not check the m-key.
I will work to implement it move it to the IBAL.

  _____  

From: Fab Tillier [mailto:ftillier at windows.microsoft.com] 
Sent: Wednesday, January 31, 2007 6:55 AM
To: Yossi Leybovich; openib-windows at openib.org
Subject: RE: [Openib-windows] [ANNOUNCE] Build 1.0.0.566 posted

You should ideally be able to cache PKey and GID tables.  The HCA driver's
SMP cache is setup to do that, but I don't remember why I didn't set it up
to actually do it.

Note however that the SMP cache in the HCA driver is still invoked from the
passive level thread, so it doesn't quite solve the problem.  I don't know
if the cache should be in IBAL.  If implemented in the HCA driver, I think
using a passive thread (or even just a work item could provide async local
MAD functionality even if the underlying HCA driver implementation blocks,
allowing the elimination of the local mad complexity in IBAL.

-Fab

From: Yossi Leybovich [mailto:sleybo at dev.mellanox.co.il] 
Sent: Tuesday, January 30, 2007 6:47 AM
To: 'Yossi Leybovich'; Fab Tillier; openib-windows at openib.org
Subject: RE: [Openib-windows] [ANNOUNCE] Build 1.0.0.566 posted

Two more things.

1. pls note that even in case that the simple m-key check is good we still
need to reset m-key lease timer (which is used by HW/FW ) 

so we still need to forward the first good packet to the FW.

2. Does there any reason why we not keep GID/Pkey tables in cache ? Cant we
answer this packets from cache ?

Yossi 

  _____  

From: openib-windows-bounces at openib.org
[mailto:openib-windows-bounces at openib.org] On Behalf Of Yossi Leybovich
Sent: Monday, January 29, 2007 4:29 PM
To: 'Fab Tillier'; openib-windows at openib.org
Subject: Re: [Openib-windows] [ANNOUNCE] Build 1.0.0.566 posted

see my comments below.

  _____  

From: openib-windows-bounces at openib.org
[mailto:openib-windows-bounces at openib.org] On Behalf Of Fab Tillier
Sent: Thursday, January 25, 2007 7:34 PM
To: Yossi Leybovich; openib-windows at openib.org
Subject: Re: [Openib-windows] [ANNOUNCE] Build 1.0.0.566 posted

Hi Yossi,

A question about r538:

------------------------------------------------------------------------
r538 | sleybo | 2006-11-07 08:54:25 +0200 (Tue, 07 Nov 2006) | 3 lines

[IBAL] Compliance tests
1. pass switch_info to the HCA - compliance test C13-026
2. Not use AL cashe for node_description node_info to force Mkey check
-compliance test C14-018
------------------------------------------------------------------------

Have you tested to see what the effects of removing the cache for node
description and node info are on SM sweeps when the system is busy?

I initially added the cache for these so that the response could be issued
in the context of the CQ callback for the special QP (thus at
DISPATCH_LEVEL).  Without the cache processing requires a call to the local
MAD verb, which has to be scheduled on a passive-level thread.  If the
system is very busy doing I/O (i.e. lots of small packets in Iometer over
IPoIB), I have seen cases where the local MAD thread does not run fast
enough so the response time for the MAD is too long and the SM declares the
node as having failed and removes it from the fabric.  This is pretty nasty,
as suddenly all IB multicast group memberships are lost, but there's no
indication to the host that things went awry.

There are two solutions for this, one is more of a temporary fix than the
other IMO.  First, the temporary fix: perform the MKey check in software, so
that the MAD response for as many MADs can be generated at DISPATCH_LEVEL
from the context of the special QP's CQ callback.  This should maintain
compliance while also keeping response times for MADs as short as possible.
[Yossi Leybovich] To solve the problem of denial of service I will add
simple m-key check . In any case of error (or not trivial m-key check (i.e
m-key =0) I will disable the cache and move the MAD to the FW

( I don't want to count the m key violation and of course not to add code
that generate traps).

This will reduce the handling of good flow packets.

The second solution is to make the local MAD verb asynchronous.  The HCA
handles the command asynchronously anyway, so this is a more natural fit
given the HW design.  This would mean the local MAD verb would be called
directly from the CQ callback (at DISPATCH_LEVEL), and would return pending.
When the local MAD is processed and the HCA generates the response to the
EQ, the driver could invoke a callback to indicate completion (again at
DISPATCH_LEVEL) which would send out the response.  This solution eliminates
the thread scheduling issues associated with handling local MAD requests in
a passive-level thread.
[Yossi Leybovich] 

This will require to test our driver with async commands ( I think that
Leonid does not fully support it Leonid?) I don't think we will have the
time to do that in the short future 

We should make sure that systems aren't susceptible to Denial of Service
attacks from someone flooding them with IPoIB traffic (which gets handled at
DISPATCH_LEVEL in IPoIB's CQ callback).  It's bad if an application on one
host can cause another host to be removed from the fabric - there will be no
port down events, no notification to the SM when the host is responsive
again, and the host will not be able to participate properly in the fabric
until the next SM sweep.

-Fab

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20070131/a986f290/attachment.html>