[openib-general] IB_CM_REJ_INVALID_SERVICE_ID

Eric Barton eeb at bartonsoftware.com
Thu Jan 4 06:29:18 PST 2007


Sean,

> Eric Barton wrote:
> > Can an rdma_connect be rejected with IB_CM_REJ_INVALID_SERVICE_ID
> > for any other reason than the peer isn't listening with the
> > correct service number?
> 
> This should only occur if the remote peer isn't listening.  This
> reject code is automatically sent by the ib_cm when a request does
> not find a corresponding listen.
> 
> >>We are testing 1.6b5 for a InfiniBand cluster with RHEL 4. We use
> >>the binaries provides by CFS and use OFED 1.1 as the IB stack.
> >>
> >>At several times some of the clients hang during fs mount or when
> >>an OST is added (see log).  Error:LustreError:
> >>1776:0:(o2iblnd_cb.c:2314:kiblnd_rejected())
> >>10.0.90.8 at o2ib rejected: reason 8, size 148
> 
> Is this event = 8 and status = 8?

yes

> >>from OFED: enum ib_cm_rej_reason {
> >>       IB_CM_REJ_INVALID_SERVICE_ID = 8,
> >>
> >>Once an IPoIB ping is started to the corresponding OST the client
> >>continues. Afterwards it is quite stable.
> > 
> > 
> > ...which seems to be saying that just doing an IPoIB ping to the
> > server was enough to make rdma_connect() work OK.
> 
> I can't explain the relationship between the ping and the connect
> starting to work.

some more from the customer...

> We have removed the two Mellanox cards from the OSS and put a single
> Voltaire card in it.  This seems to work. I could connect 60 nodes
> without errors. We will further investigate this to better
> understand the cause of the original problem.

...I hadn't realised they had > 1 HCA.  We bind the listen ID to a
specific IP address - could that have some bearing?  AFAICS from the
customer's debug logs, they are listening on the correct HCA...

-- 

                Cheers,
                        Eric






More information about the general mailing list