[ofa-general] Race condition in core/sysfs.c (kernel panic) when unloading the driver

Jack Morgenstein jackm at dev.mellanox.co.il
Tue Feb 17 07:42:38 PST 2009


We have found a race condition in sysfs.c which occurs when unloading low-level modules
(e.g., mlx4_ib) in the driver.
Specifically:

Although the kernel takes reference counts on sysfs files, it does not take such counts
on modules which implement attribute reads.

For example, we have:
static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr,
			      char *buf)
{
	struct port_table_attribute *tab_attr =
		container_of(attr, struct port_table_attribute, attr);
	u16 pkey;
	ssize_t ret;
====>race condition HERE *****
	ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey);
	if (ret)
		return ret;

	return sprintf(buf, "0x%04x\n", pkey);
}

The sysfs file /sys/class/infiniband/<device>/ports/1/pkey/<pkey number> is protected
from destruction while we are in show_port_pkey.
However, the underlying module which implements ib_query_pkey (in this case, mlx4_ib) is not.

Thus, if another process is busy unloading mlx4_ib, and the time-slice of the process
which is reading sysfs expires at the point indicated above in the code, ib_query_pkey()
will fail with a page-fault (kernel panic), since it will not find the code page which implements
ib_query_pkey() (inlined to the query_pkey() function in the low-level driver virtual function table).

Now, when a low-level driver is unloaded, the following procedure (in sysfs.c) is called:
void ib_device_unregister_sysfs(struct ib_device *device)
{
	struct kobject *p, *t;
	struct ib_port *port;

	list_for_each_entry_safe(p, t, &device->port_list, entry) {
		list_del(&p->entry);
		port = container_of(p, struct ib_port, kobj);
		mutex_lock(&port->mutex);
		port->valid = 0;
		sysfs_remove_group(p, &pma_group);
		sysfs_remove_group(p, &port->pkey_group);
		sysfs_remove_group(p, &port->gid_group);
		mutex_unlock(&port->mutex);
		kobject_put(p);
	}

	kobject_put(device->ports_parent);
	device_unregister(&device->dev);
}

After this call, the kernel continues with unloading the low-level module.
However, until device_unregister(&device->dev) is invoked, the
sysfs attribute path for the low-level device is still valid.  Hence the race condition -- 

Process A			    Process B
---------                       ---------------
1. starts unloading low-level mod
				2. cat /sys/class/infiniband/...
                                3. Time slice runs out just before accessing low-level
                                   module for requested info.
4. Low-level module is fully unloaded
				5. Page-fault panic when trying to access a procedure in
                                   the just-unloaded module.

Some attempt was made for some (but not all) of the "show" procedures to check if the module is alive:
	if (!ibdev_is_alive(p->ibdev))
		return -ENODEV;

This narrows the race window considerably, but does not eliminate it. (I put this fix in show_port_pkey(),
and was still able to generate the kernel panic).

The only way I was able to eliminate the kernel panic entirely was via a mutex (declaration and init not shown):
static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr,
			      char *buf)
{
	struct port_table_attribute *tab_attr =
		container_of(attr, struct port_table_attribute, attr);
	u16 pkey;
	ssize_t ret;

	mutex_lock(&p->mutex);
==>	if (p->valid)
		ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey);
	else
		ret = -EINVAL;
==>	mutex_unlock(&p->mutex);
	if (ret)
		return ret;

	return sprintf(buf, "0x%04x\n", pkey);
}

and:
void ib_device_unregister_sysfs(struct ib_device *device)
{
	struct kobject *p, *t;
	struct ib_port *port;

	list_for_each_entry_safe(p, t, &device->port_list, entry) {
		list_del(&p->entry);
		port = container_of(p, struct ib_port, kobj);
==>		mutex_lock(&port->mutex);
		port->valid = 0;
		sysfs_remove_group(p, &pma_group);
		sysfs_remove_group(p, &port->pkey_group);
		sysfs_remove_group(p, &port->gid_group);
==>		mutex_unlock(&port->mutex);
		kobject_put(p);
	}

	kobject_put(device->ports_parent);
	device_unregister(&device->dev);
}

This is approach is fine for the port-based groups.  What about class-device attributes themselves?
I believe that the best approach is to add a sysfs_mutex to ib_device, and lock that for ALL "show" methods
in this file.

Opinions?

- Jack



More information about the general mailing list