[ofa-general] [PATCH 1/1 V2] SDP - Fix reference count bug that prevents mlx4_ib and ib_sdp unload

Jim Mott jim at mellanox.com
Tue Nov 6 15:09:49 PST 2007


It is an SDP bug.  

The test that found the problem had a symptom where the "rmmod mlx4_ib"
command would hang in an uninterruptable sleep in cma_remove_one().
Also any attempt to unload ib_sdp would also hang.  The original V1 post
of this patch was very light on detail and I took the V2 opportunity to
explain.  Almost makes up for the stupid mistake in the first patch...

The process used to duplicate the bug and verify this fix is to use 3
nodes (1 just for SM to not confuse things), and execute the steps 1-7
below:

nod0: (MLX4)
  0) opensm started
  
node1: (MLX4) [With ib_sdp loaded and LD_PRELOAD setup]
  1) netserver 
  3) /sbin/rmmod mlx4_ib && /sbin/modprobe mlx4_ib (in parallel to 2)
     *** HANGS before fix; Works after ***
  6) killall netserver
  7) modprobe -r ib_sdp
     *** HANGS before fix; works after ***

node2: (MLX4) [With ib_sdp loaded and LD_PRELOAD setup]
  2) netperf -C -c -P 0 -t TCP_STREAM -H green_ib -l 120 -- -m 1000000
  4) after failure ^C or just wait for netperf to end on its own with
       "netperf: cannot shutdown tcp stream socket: Transport endpoint
     is not connected"
  5) /etc/init.d/openibd stop
     *** WORKS before and after fix ***

Thanks,
JIm

Jim Mott
Mellanox Technologies Ltd.
mail: jim at mellanox.com
Phone: 512-294-5481


-----Original Message-----
From: Roland Dreier [mailto:rdreier at cisco.com] 
Sent: Tuesday, November 06, 2007 4:56 PM
To: Jim Mott
Cc: openib-general at openib.org
Subject: Re: [ofa-general] [PATCH 1/1 V2] SDP - Fix reference count bug
that prevents mlx4_ib and ib_sdp unload

What does this have to do with mlx4?  It seems it is just a bug in SDP
related to hot-removing any device, right?

 - R.



More information about the general mailing list