[ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created
Jeff Squyres
jsquyres at cisco.com
Wed May 13 12:59:18 PDT 2009
Ok, I figured it out. I have some creative /etc/sysconfig/network-
script/ifcfg-ib* scripts that may choose to do nothing if no device is
present (or some other esoteric, specific-to-jeffs-cluster criteria is
met) -- they call "exit 0" in this case. This apparently causes the
top-level /etc/init.d/openibd to exit (!). I've fixed this (they now
never call "exit"); now everything works as expected.
Upon reflection, I can see that this was totally my fault -- ifcfg-*
scripts are always sourced and should therefore never call "exit".
But given that /etc/init.d/openib is sooo complex and has sooo many
moving parts, it would be nice if there were a way to track down
problems a little more easily; perhaps a "verbose" setting in /etc/
infiniband/openibd.conf, or somesuch. Indeed, since OFED is targeted
at the datacenter, monitors attached to the servers in question and/or
serial consoles may not be readily available. Hence, having the
ability to drop some verbose output into syslog during boot, for
example, might be quite useful to sysadmins/network admins when
troubleshooting.
Just my $0.02.
Thanks for the tips where to look, Woody!
On May 13, 2009, at 3:18 PM, Jeff Squyres (jsquyres) wrote:
> On May 13, 2009, at 3:12 PM, Woodruff, Robert J wrote:
>
> > Check to see if some other driver failed to load.
> > I think I have seen before that if another driver
> > fails to load, the start script bails out and
> > does not load the other drivers.
> >
> > Perhaps try doing a /etc/init.d/openibd restart
> > manually to see if something is failing to load.
> >
>
> Weird -- doing it manually shows no problem:
>
> [root at svbu-mpi055 ~]# /etc/init.d/openibd restart
> Unloading HCA driver: [ OK ]
> Loading HCA driver and Access Layer: [ OK ]
> Setting up InfiniBand network interfaces:
> Bringing up interface ib0: [ OK ]
> Bringing up interface ib1: [ OK ]
> Setting up service network . . . [ done ]
> [root at svbu-mpi055 ~]# ls -l /dev/infiniband/rdma_cm
> crw-rw-rw- 1 root root 10, 62 May 13 12:17 /dev/infiniband/rdma_cm
> [root at svbu-mpi055 ~]#
>
> Something must be going wrong during the bootup. I'm unfortunately
> several thousand miles from the server and don't have a serial
> console. I guess I'll insert some initlog's in /etc/init.d/openibd...
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
--
Jeff Squyres
Cisco Systems
More information about the ewg
mailing list