[openib-general] osm unreliable unless -d1

Eitan Zahavi eitan at mellanox.co.il
Sun Mar 5 01:40:29 PST 2006


Hi J-C

OpenSM ignores re-occurring traps from the same source if the rate of
traps exceeds some threshold. In your case the switch is probably
sending trap 128 every time the port is brought down and up again. You
should see the following message in the osm.log

      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
               "__osm_trap_rcv_process_request: ERR 3804: "
               "Received the trap %u times continuously\n",
               num_received);

If this occurs you know why the link is still down...

You should try adding some more sleep between the changes (I thing the
trap filtering uses a timeout of 5 sec.

Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: openib-general-bounces at openib.org [mailto:openib-general-
> bounces at openib.org] On Behalf Of Jean-Christophe Hugly
> Sent: Saturday, March 04, 2006 4:28 AM
> To: openib-general at openib.org
> Subject: [openib-general] osm unreliable unless -d1
> 
> 
> Hi Guys,
> 
> I have been having trouble with gen2's osm for a while. I finally
> isolated the faulty behaviour to one easy test case:
> 
> run osm somewhere.
> 
> then one whatever workstation has an HCA connected to the same subnet,
> do this:
> 
> i=1
> while true; do
> 	modprobe -r ib_mthca
> 	sleep 3
> 	modprobe ib_mthca
> 	ibstat
> 	echo $i
> 	sleep 3
> 	i=`expr $i + 1`
> done
> 
> For me, after i reaches 7 or 8, the port no-longer gets initialized
and
> ibstat reports:
> 
>                State: Initializing
>                Physical state: LinkUp
> 
> On the other hand if you run osm with -d1 option (mostly
> single-threaded), then it seems to work indefinitely.
> 
> I did this with osm r5594, compiled and running on suse10 (dual xeon)
> with openib of the same rev. The "client side" is the same os and rev;
> cpus are 4 opterons.
> 
> I have not started to look for faulty mutexes, yet. Where the fixes
> recently proposed in that area committed as of 5594 ?
> 
> 
> --
> Jean-Christophe Hugly <jice at pantasys.com>
> PANTA
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general




More information about the general mailing list