[openib-general] SM Bad Port Handling

Tue Apr 19 07:00:05 PDT 2005

On Thu, 2005-04-14 at 07:13, Eitan Zahavi wrote: 
> [EZ] Not at all. Although the target port is known. The flaky link
> that fails the mad might be anywhere along the path to the port. So,
> if you mark the target port as bad you might be marking the wrong
> port!

OpenSM does look for traps 128-131 which includes Local link integrity
(129) which is likely from these noisy ports, right ?

>  [EZ] Let me clarify with an example:
> SM=HCA1/P1 ->
> SW1/P1....SW1/P2->SW2/P1..SW2/P2->SW3/P1....SW3/P3->HCA2/P1
>                                            
> \..SW4/P4->SW3/P4..SW3/P5->SW3/P2../
>                            
> If the flaky link is between SW2/P2 and SW3/P1 then the packet sent to
> HCA2 using DR : [0][1][1][2][3] might fail . If you mark HCA2/P1 as
> bad then you actually will loose that HCA for no good reason since
> another path from SM to HCA2 exists.

OK. To be really sure about the failed port, one could then walk the
entire DR path from the SM to the perceived non responding port and if
the same port along the path doesn't respond some number of times (say
4) in a row, that port's peer port can be marked as unhealthy. Is this
algorithm acceptable ?

-- Hal