[openib-general] osm unreliable unless -d1
Sasha Khapyorsky
sashak at voltaire.com
Sat Mar 4 14:22:54 PST 2006
Hi Jean,
Thanks for reporting.
On 18:28 Fri 03 Mar , Jean-Christophe Hugly wrote:
>
> run osm somewhere.
>
> then one whatever workstation has an HCA connected to the same subnet,
> do this:
>
> i=1
> while true; do
> modprobe -r ib_mthca
> sleep 3
> modprobe ib_mthca
> ibstat
> echo $i
> sleep 3
> i=`expr $i + 1`
> done
>
> For me, after i reaches 7 or 8, the port no-longer gets initialized and
> ibstat reports:
>
> State: Initializing
> Physical state: LinkUp
>
> On the other hand if you run osm with -d1 option (mostly
> single-threaded), then it seems to work indefinitely.
I've tried your script and don't see any difference between modes with
and without -d1, however my network is small - two hosts and switch,
probably this is different from your.
Also I see that finally port becomes active but after delay. Those
delays look strange and inconsistent, I will need to test more tomorrow.
Could you try such modification for your script?
i=1
while true; do
modprobe -r ib_mthca
sleep 3
modprobe ib_mthca
count=0
while true ; do
ibstat | egrep 'State: Active$' > /dev/null
test $? -eq 0 && break
count=`expr $count + 1`
sleep 1
done
echo $i: delay $count
sleep 3
i=`expr $i + 1`
done
> I did this with osm r5594, compiled and running on suse10 (dual xeon)
> with openib of the same rev. The "client side" is the same os and rev;
> cpus are 4 opterons.
>
> I have not started to look for faulty mutexes, yet. Where the fixes
> recently proposed in that area committed as of 5594 ?
It is not committed yet and I think that the problems are different
there (not sure however).
One reseeper related simpthom which "atomic" patch should solve is when
outstanding mad counter becomes corrupted and has negative values - this
stucks osm in resweep state. But in my tests it takes longer time to
reproduce this failure (but again, my network is small).
Sasha.
More information about the general
mailing list