[openib-general] IB_CM_REJ_INVALID_SERVICE_ID
Eric Barton
eeb at bartonsoftware.com
Thu Jan 4 06:29:18 PST 2007
Sean,
> Eric Barton wrote:
> > Can an rdma_connect be rejected with IB_CM_REJ_INVALID_SERVICE_ID
> > for any other reason than the peer isn't listening with the
> > correct service number?
>
> This should only occur if the remote peer isn't listening. This
> reject code is automatically sent by the ib_cm when a request does
> not find a corresponding listen.
>
> >>We are testing 1.6b5 for a InfiniBand cluster with RHEL 4. We use
> >>the binaries provides by CFS and use OFED 1.1 as the IB stack.
> >>
> >>At several times some of the clients hang during fs mount or when
> >>an OST is added (see log). Error:LustreError:
> >>1776:0:(o2iblnd_cb.c:2314:kiblnd_rejected())
> >>10.0.90.8 at o2ib rejected: reason 8, size 148
>
> Is this event = 8 and status = 8?
yes
> >>from OFED: enum ib_cm_rej_reason {
> >> IB_CM_REJ_INVALID_SERVICE_ID = 8,
> >>
> >>Once an IPoIB ping is started to the corresponding OST the client
> >>continues. Afterwards it is quite stable.
> >
> >
> > ...which seems to be saying that just doing an IPoIB ping to the
> > server was enough to make rdma_connect() work OK.
>
> I can't explain the relationship between the ping and the connect
> starting to work.
some more from the customer...
> We have removed the two Mellanox cards from the OSS and put a single
> Voltaire card in it. This seems to work. I could connect 60 nodes
> without errors. We will further investigate this to better
> understand the cause of the original problem.
...I hadn't realised they had > 1 HCA. We bind the listen ID to a
specific IP address - could that have some bearing? AFAICS from the
customer's debug logs, they are listening on the correct HCA...
--
Cheers,
Eric
More information about the general
mailing list