[ewg] possible bug in rds?

Eli Cohen eli at dev.mellanox.co.il
Wed Mar 10 08:13:01 PST 2010


Hi Andy,

in our regression tests we've encountered a kernel oops with the
following stack dump:

<start quote>
Call trace: 
Mar  1 05:45:50 sw134 kernel: mlx4_en: eth2: Link Down 
Mar  1 05:46:00 sw134 kernel: mlx4_en: eth2: Link Up 
Mar  1 05:46:00 sw134 kernel: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
Mar  1 05:46:01 sw134 /usr/sbin/cron[16940]: (root) CMD (/mswg/projects/test_suite2/etc/check_daemon.csh >/dev/null) 
Mar  1 05:46:01 sw134 /usr/sbin/cron[16941]: (root) CMD (/usr/check_mswg.csh >/dev/null) 
Mar  1 05:46:01 sw134 /usr/sbin/cron[16942]: (root) CMD (/.autodirect/LIT/CRONTABS/do_it_now.sh > /dev/null) 
Mar  1 05:46:03 sw134 kernel: Unable to handle kernel paging request at 0000000000200200 RIP:
Mar  1 05:46:03 sw134 kernel: <ffffffff88427b8e>{:rdma_cm:rdma_destroy_id+399} 
Mar  1 05:46:03 sw134 kernel: PGD 0 
Mar  1 05:46:03 sw134 kernel: Oops: 0002 [1] SMP 
Mar  1 05:46:03 sw134 kernel: last sysfs file: /class/infiniband/mlx4_0/ports/1/gids/127 
Mar  1 05:46:03 sw134 kernel: CPU 0 
Mar  1 05:46:03 sw134 kernel: Modules linked in: 8021q mst_pciconf mst_pci rdma_ucm rds_tcp rds_rdma rds ib_ucm ib_sdp rdma_cm iw_cm ib_addr ib_cm ib_sa ib_uverbs ib_umad mlx4_en mlx4_core ib_mad ib_core memtrack autofs4 cpufreq_ondemand cpufreq_userspace cpufreq_powersave powernow_k8 freq_table nfs lockd nfs_acl sunrpc ipv6 af_packet dock button battery ac apparmor nls_iso8859_1 nls_cp437 vfat fat loop dm_mod ohci_hcd ide_cd cdrom generic ehci_hcd shpchp pci_hotplug i2c_piix4 i2c_core usbcore mptctl tg3 floppy ext3 jbd edd fan thermal processor mptsas mptscsih sg mptbase scsi_transport_sas sata_svw libata serverworks sd_mod scsi_mod ide_disk ide_core 
Mar  1 05:46:03 sw134 kernel: Pid: 15000, comm: krdsd Tainted: GU 2.6.16.60-0.54.5-smp #1
Mar  1 05:46:03 sw134 kernel: RIP: 0010:[<ffffffff88427b8e>] <ffffffff88427b8e>{:rdma_cm:rdma_destroy_id+399} 
Mar  1 05:46:03 sw134 kernel: RSP: 0018:ffff81000dad7dd8  EFLAGS: 00010206 
Mar  1 05:46:03 sw134 kernel: RAX: 0000000000100100 RBX: ffff81012d2ba740 RCX: 0000000000200200 
Mar  1 05:46:03 sw134 kernel: RDX: ffff81010ee445b8 RSI: ffff8101248c0048 RDI: ffff81012bdaf800 
Mar  1 05:46:03 sw134 kernel: RBP: ffff81010ee44400 R08: 0000000000000000 R09: 0000000000000000 
Mar  1 05:46:03 sw134 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8101248c0048 
Mar  1 05:46:03 sw134 kernel: R13: ffff8101248c0290 R14: ffffffff8846ca40 R15: 0000000000000000 
Mar  1 05:46:03 sw134 kernel: FS:  00002b2f96622ae0(0000) GS:ffffffff803dc000(0000) knlGS:0000000000000000 
Mar  1 05:46:03 sw134 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b 
Mar  1 05:46:03 sw134 kernel: CR2: 0000000000200200 CR3: 0000000000101000 CR4: 00000000000006e0 
Mar  1 05:46:03 sw134 kernel: Process krdsd (pid: 15000, threadinfo ffff81000dad6000, task ffff810126bf5040) 
Mar  1 05:46:03 sw134 kernel: Stack: ffff81012bdaf800 ffff81012bdaf800 ffff81010e7fa000 ffffffff884891f4 
Mar  1 05:46:03 sw134 kernel:        ffff81000100f700 0000230b363c2cf0 000000000002220f 000000000f5eb200 
Mar  1 05:46:03 sw134 kernel:        ffff8101248c0048 ffffffff802f0652 
Mar  1 05:46:03 sw134 kernel: Call Trace: <ffffffff884891f4>{:rds_rdma:rds_ib_conn_shutdown+477} 
Mar  1 05:46:03 sw134 kernel:        <ffffffff802f0652>{mutex_lock+13} <ffffffff8846cae3>{:rds:rds_shutdown_worker+163} 
Mar  1 05:46:04 sw134 kernel: <ffffffff80144e8e>{run_workqueue+139} <ffffffff8014559c>{worker_thread+0} 
Mar  1 05:46:04 sw134 kernel: <ffffffff80148525>{keventd_create_kthread+0} <ffffffff80145690>{worker_thread+244} 
Mar  1 05:46:04 sw134 kernel: <ffffffff8012cf89>{default_wake_function+0} <ffffffff801487ed>{kthread+236} 
Mar  1 05:46:04 sw134 kernel:        <ffffffff8010bea6>{child_rip+8} <ffffffff80148525>{keventd_create_kthread+0} 
Mar  1 05:46:04 sw134 kernel:        <ffffffff80148701>{kthread+0} <ffffffff8010be9e>{child_rip+0} 
Mar  1 05:46:04 sw134 kernel: 
Mar  1 05:46:04 sw134 kernel: Code: 48 89 01 74 04 48 89 48 08 48 c7 85 b8 01 00 00 00 01 10 00 
Mar  1 05:46:04 sw134 kernel: RIP <ffffffff88427b8e>{:rdma_cm:rdma_destroy_id+399} RSP <ffff81000dad7dd8> 
Mar  1 05:46:04 sw134 kernel: CR2: 0000000000200200 
Mar  1 05:46:09 sw134 kernel:  <6>mlx4_en: eth2: Link Down 
Mar  1 05:46:20 sw134 kernel: mlx4_en: eth2: Link Up 
Mar  1 05:46:20 sw134 kernel: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready 
Mar  1 05:46:20 sw134 kernel: mlx4_en: eth2: Link Down 
Mar  1 05:46:20 sw134 kernel: mlx4_en: eth2: Link Up 
Mar  1 05:46:21 sw134 kernel: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready 
<end quote>

Examining the dump I see the failure results in trying to call
hlist_del() twice on the same pointer (I can see that by the poisoned
pointer RCX: 0000000000200200).
Could it be that rds will call rdma_destroy_id() which will result in
the described behaviour?






More information about the ewg mailing list