[ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created

Jeff Squyres jsquyres at cisco.com
Wed May 13 12:59:18 PDT 2009


Ok, I figured it out.  I have some creative /etc/sysconfig/network- 
script/ifcfg-ib* scripts that may choose to do nothing if no device is  
present (or some other esoteric, specific-to-jeffs-cluster criteria is  
met) -- they call "exit 0" in this case.  This apparently causes the  
top-level /etc/init.d/openibd to exit (!).  I've fixed this (they now  
never call "exit"); now everything works as expected.

Upon reflection, I can see that this was totally my fault -- ifcfg-*  
scripts are always sourced and should therefore never call "exit".

But given that /etc/init.d/openib is sooo complex and has sooo many  
moving parts, it would be nice if there were a way to track down  
problems a little more easily; perhaps a "verbose" setting in /etc/ 
infiniband/openibd.conf, or somesuch.  Indeed, since OFED is targeted  
at the datacenter, monitors attached to the servers in question and/or  
serial consoles may not be readily available.  Hence, having the  
ability to drop some verbose output into syslog during boot, for  
example, might be quite useful to sysadmins/network admins when  
troubleshooting.

Just my $0.02.

Thanks for the tips where to look, Woody!



On May 13, 2009, at 3:18 PM, Jeff Squyres (jsquyres) wrote:

> On May 13, 2009, at 3:12 PM, Woodruff, Robert J wrote:
>
> > Check to see if some other driver failed to load.
> > I think I have seen before that if another driver
> > fails to load, the start script bails out and
> > does not load the other drivers.
> >
> > Perhaps try doing a /etc/init.d/openibd restart
> > manually to see if something is failing to load.
> >
>
> Weird -- doing it manually shows no problem:
>
> [root at svbu-mpi055 ~]# /etc/init.d/openibd restart
> Unloading HCA driver:                                      [  OK  ]
> Loading HCA driver and Access Layer:                       [  OK  ]
> Setting up InfiniBand network interfaces:
> Bringing up interface ib0:                                 [  OK  ]
> Bringing up interface ib1:                                 [  OK  ]
> Setting up service network . . .                           [  done  ]
> [root at svbu-mpi055 ~]# ls -l /dev/infiniband/rdma_cm
> crw-rw-rw-  1 root root 10, 62 May 13 12:17 /dev/infiniband/rdma_cm
> [root at svbu-mpi055 ~]#
>
> Something must be going wrong during the bootup.  I'm unfortunately
> several thousand miles from the server and don't have a serial
> console.  I guess I'll insert some initlog's in /etc/init.d/openibd...
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


-- 
Jeff Squyres
Cisco Systems




More information about the ewg mailing list