[openib-general] Re: [Sc05-ib] OpenSM (lack of) error handling

Hal Rosenstock halr at voltaire.com
Fri Nov 18 18:36:03 PST 2005


On Thu, 2005-11-10 at 01:36, Troy Benjegerdes wrote:
> OpenSM does NOT handle links that generate errors very well at ALL. We 
> have several flakey links on the SC05 show floor, and opensm is 
> segfaulting and generally not very happy about it.
> 
> Is there a reasonable way to partition off links that generate lots of 
> errors without physically unplugging them?

The port on the other end of the link can be physically disabled. A
management command to do this can be added. Ifg the SM were embedded on
a switch, then all ports on the links opposite the switch ports would
need to be disabled.

The harder part is detecting that this needs to be done (SM Key mismatch
might be one but the other SM could not play by the rules and respond
properly in the real rogue case). Also, the policy for doing this is
hard especially if that policy is built into the SM rather than the
network administrator (a person) issuing a manual command). It does not
take care of the nodes which were claimed by that other SM. That can be
problematic if the other SM set MKeyProtect bits to 2 or 3 and an
infinite lease period. There is no way to reclaim them in that
particular case other than rebooting those nodes.
 
> Also, what is to prevent any random IB client that plugs in from using 
> MAD packets to reset port counters?

Anyone in your partition(s) can do this.

-- Hal




More information about the general mailing list