[ofa-general] 2.6.16.46-0.12-SLERT-10-15: scheduling while atomic ?

Wed Feb 11 08:21:13 PST 2009

Hello,

On a Suse real time kernel (2.6.16.46-0.12-SLERT-10-15), we get the 
following kernel stack trace while running SDP traffic:

scheduling while atomic: ib_cm/4/0x00000001/18293

Call Trace:
       <ffffffff80324eed>{__sched_text_start+125}
       <ffffffff801375ca>{lock_timer_base+27}
       <ffffffff80327a03>{_spin_unlock_irqrestore+53}
       <ffffffff80137dc3>{__mod_timer+439}
       <ffffffff80326870>{schedule_timeout+208}
       <ffffffff80137fe5>{process_timeout+0}
       <ffffffff80327a43>{_spin_unlock_irq+52}
       <ffffffff80326232>{wait_for_completion_timeout+127}
       <ffffffff80126fd4>{default_wake_function+0}
       <ffffffff88515af6>{:mlx4_core:__mlx4_cmd+318}
       <ffffffff8851c148>{:mlx4_core:mlx4_mr_free+73}
       <ffffffff8852f0b8>{:mlx4_ib:mlx4_ib_dereg_mr+23}
       <ffffffff884cfb9b>{:ib_core:ib_dereg_mr+26}
       <ffffffff88633592>{:ib_sdp:sdp_destroy_qp+161}
       <ffffffff88633c6d>{:ib_sdp:sdp_reset_sk+276}
       <ffffffff88637f43>{:ib_sdp:sdp_cma_handler+2008}
       <ffffffff885f9224>{:ib_cm:cm_work_handler+0}
       <ffffffff886284f4>{:rdma_cm:cma_modify_qp_err+72}
       <ffffffff80125876>{__wake_up_common+62}
       <ffffffff80327a03>{_spin_unlock_irqrestore+53}
       <ffffffff885f9224>{:ib_cm:cm_work_handler+0}
       <ffffffff88629c25>{:rdma_cm:cma_ib_handler+369}
       <ffffffff885f7e52>{:ib_cm:cm_process_work+26}
       <ffffffff885f95fe>{:ib_cm:cm_work_handler+986}
       <ffffffff885f9224>{:ib_cm:cm_work_handler+0}
       <ffffffff8013f91e>{run_workqueue+154}
       <ffffffff80324e76>{__sched_text_start+6}
       <ffffffff8013ffb2>{worker_thread+0}
       <ffffffff8014162a>{keventd_create_kthread+0}
       <ffffffff801400ae>{worker_thread+252}
       <ffffffff80126fd4>{default_wake_function+0}
       <ffffffff8014162a>{keventd_create_kthread+0}
       <ffffffff8014190a>{kthread+212}
       <ffffffff8015865c>{hracct_exit_syscall+22}
       <ffffffff8010bd5e>{child_rip+8}
       <ffffffff8014162a>{keventd_create_kthread+0}
       <ffffffff80141836>{kthread+0}
       <ffffffff8010bd56>{child_rip+0}

The OFA kernel package in place is:

git://git.openfabrics.org/ofed_1_4/linux-2.6.git ofed_kernel
commit 88ab7955605c5e769e760f6bec980e0c2e72aa5c

Looking for the "scheduling while atomic" message in the latest kernel, we see that it was printed out by __schedule_bug in 
this function:

/*
 * Various schedule()-time debugging checks and statistics:
 */
static inline void schedule_debug(struct task_struct *prev)
{
	/*
	 * Test if we are atomic. Since do_exit() needs to call into
	 * schedule() atomically, we ignore that path for now.
	 * Otherwise, whine if we are scheduling when we should not be.
	 */
	if (unlikely(in_atomic_preempt_off() && !prev->exit_state))
		__schedule_bug(prev);

	profile_hit(SCHED_PROFILING, __builtin_return_address(0));

	schedstat_inc(this_rq(), sched_count);
#ifdef CONFIG_SCHEDSTATS
	if (unlikely(prev->lock_depth >= 0)) {
		schedstat_inc(this_rq(), bkl_count);
		schedstat_inc(prev, sched_info.bkl_count);
	}
#endif
}

Any idea as to what is going wrong here ?

Thanks for your help,

Vincent