[ofa-general] Race condition in core/sysfs.c (kernel panic) when unloading the driver
Jack Morgenstein
jackm at dev.mellanox.co.il
Tue Feb 17 07:42:38 PST 2009
We have found a race condition in sysfs.c which occurs when unloading low-level modules
(e.g., mlx4_ib) in the driver.
Specifically:
Although the kernel takes reference counts on sysfs files, it does not take such counts
on modules which implement attribute reads.
For example, we have:
static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr,
char *buf)
{
struct port_table_attribute *tab_attr =
container_of(attr, struct port_table_attribute, attr);
u16 pkey;
ssize_t ret;
====>race condition HERE *****
ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey);
if (ret)
return ret;
return sprintf(buf, "0x%04x\n", pkey);
}
The sysfs file /sys/class/infiniband/<device>/ports/1/pkey/<pkey number> is protected
from destruction while we are in show_port_pkey.
However, the underlying module which implements ib_query_pkey (in this case, mlx4_ib) is not.
Thus, if another process is busy unloading mlx4_ib, and the time-slice of the process
which is reading sysfs expires at the point indicated above in the code, ib_query_pkey()
will fail with a page-fault (kernel panic), since it will not find the code page which implements
ib_query_pkey() (inlined to the query_pkey() function in the low-level driver virtual function table).
Now, when a low-level driver is unloaded, the following procedure (in sysfs.c) is called:
void ib_device_unregister_sysfs(struct ib_device *device)
{
struct kobject *p, *t;
struct ib_port *port;
list_for_each_entry_safe(p, t, &device->port_list, entry) {
list_del(&p->entry);
port = container_of(p, struct ib_port, kobj);
mutex_lock(&port->mutex);
port->valid = 0;
sysfs_remove_group(p, &pma_group);
sysfs_remove_group(p, &port->pkey_group);
sysfs_remove_group(p, &port->gid_group);
mutex_unlock(&port->mutex);
kobject_put(p);
}
kobject_put(device->ports_parent);
device_unregister(&device->dev);
}
After this call, the kernel continues with unloading the low-level module.
However, until device_unregister(&device->dev) is invoked, the
sysfs attribute path for the low-level device is still valid. Hence the race condition --
Process A Process B
--------- ---------------
1. starts unloading low-level mod
2. cat /sys/class/infiniband/...
3. Time slice runs out just before accessing low-level
module for requested info.
4. Low-level module is fully unloaded
5. Page-fault panic when trying to access a procedure in
the just-unloaded module.
Some attempt was made for some (but not all) of the "show" procedures to check if the module is alive:
if (!ibdev_is_alive(p->ibdev))
return -ENODEV;
This narrows the race window considerably, but does not eliminate it. (I put this fix in show_port_pkey(),
and was still able to generate the kernel panic).
The only way I was able to eliminate the kernel panic entirely was via a mutex (declaration and init not shown):
static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr,
char *buf)
{
struct port_table_attribute *tab_attr =
container_of(attr, struct port_table_attribute, attr);
u16 pkey;
ssize_t ret;
mutex_lock(&p->mutex);
==> if (p->valid)
ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey);
else
ret = -EINVAL;
==> mutex_unlock(&p->mutex);
if (ret)
return ret;
return sprintf(buf, "0x%04x\n", pkey);
}
and:
void ib_device_unregister_sysfs(struct ib_device *device)
{
struct kobject *p, *t;
struct ib_port *port;
list_for_each_entry_safe(p, t, &device->port_list, entry) {
list_del(&p->entry);
port = container_of(p, struct ib_port, kobj);
==> mutex_lock(&port->mutex);
port->valid = 0;
sysfs_remove_group(p, &pma_group);
sysfs_remove_group(p, &port->pkey_group);
sysfs_remove_group(p, &port->gid_group);
==> mutex_unlock(&port->mutex);
kobject_put(p);
}
kobject_put(device->ports_parent);
device_unregister(&device->dev);
}
This is approach is fine for the port-based groups. What about class-device attributes themselves?
I believe that the best approach is to add a sysfs_mutex to ib_device, and lock that for ALL "show" methods
in this file.
Opinions?
- Jack
More information about the general
mailing list