[openib-general] osm unreliable unless -d1

Sasha Khapyorsky sashak at voltaire.com
Sat Mar 4 14:22:54 PST 2006


Hi Jean,

Thanks for reporting.

On 18:28 Fri 03 Mar     , Jean-Christophe Hugly wrote:
> 
> run osm somewhere.
> 
> then one whatever workstation has an HCA connected to the same subnet,
> do this:
> 
> i=1
> while true; do
> 	modprobe -r ib_mthca
> 	sleep 3
> 	modprobe ib_mthca
> 	ibstat
> 	echo $i
> 	sleep 3
> 	i=`expr $i + 1`
> done
> 
> For me, after i reaches 7 or 8, the port no-longer gets initialized and
> ibstat reports:
> 
>                State: Initializing
>                Physical state: LinkUp
> 
> On the other hand if you run osm with -d1 option (mostly
> single-threaded), then it seems to work indefinitely.

I've tried your script and don't see any difference between modes with
and without -d1, however my network is small - two hosts and switch,
probably this is different from your.

Also I see that finally port becomes active but after delay. Those
delays look strange and inconsistent, I will need to test more tomorrow.
Could you try such modification for your script?

i=1
while true; do
	modprobe -r ib_mthca
	sleep 3
	modprobe ib_mthca
	count=0
	while true ; do
		ibstat | egrep 'State: Active$' > /dev/null
		test $? -eq 0 && break
		count=`expr $count + 1`
		sleep 1
	done
	echo $i: delay $count
	sleep 3
	i=`expr $i + 1`
done

> I did this with osm r5594, compiled and running on suse10 (dual xeon)
> with openib of the same rev. The "client side" is the same os and rev;
> cpus are 4 opterons.
> 
> I have not started to look for faulty mutexes, yet. Where the fixes
> recently proposed in that area committed as of 5594 ?

It is not committed yet and I think that the problems are different
there (not sure however).

One reseeper related simpthom which "atomic" patch should solve is when
outstanding mad counter becomes corrupted and has negative values - this
stucks osm in resweep state. But in my tests it takes longer time to
reproduce this failure (but again, my network is small).

Sasha.



More information about the general mailing list