I've been fighting with this bug for 2 months now. Sometimes I get uptime of a couple of weeks. Sometimes only a couple of days. Very weird and seems resilient to kernel changes.
My setup is a Dual Quad Core Xeon 5405 (6Mbyte cache) mounted on a supermicro X7DCL-i motherboard with 8gigs of DDR2 ECC ram memory. The hardware did complete a 48 hours memtest successfully so I'm quite confident it's not MB/RAM/Hardware issue. The BIOS is the latest available (8/18/2008 from Supermicro).
I've seen the bug randomly with both 2.6.24-18-xen and 2.6.24-19-xen versions of the kernel. The process that dies can be anything in Dom0, DomU and seems unrelated to the actual process/module that is executed. It seems somewhat related to the order of processes loaded (in my case the order of domU startup). The 2.6.24-18 kernel seems a little more stable, but this could be a coincidence.
So far I've seen crashing a simple ext2 formatting, various processes in different domU, various processes in dom0. The offending process is somewhat sticky (leading me to believe a memory/hardware issue) but I ruled that out above.
The latest incarnation is a clamd process that won't live longher than few hours without crashing:
I've been fighting with this bug for 2 months now. Sometimes I get uptime of a couple of weeks. Sometimes only a couple of days. Very weird and seems resilient to kernel changes.
My setup is a Dual Quad Core Xeon 5405 (6Mbyte cache) mounted on a supermicro X7DCL-i motherboard with 8gigs of DDR2 ECC ram memory. The hardware did complete a 48 hours memtest successfully so I'm quite confident it's not MB/RAM/Hardware issue. The BIOS is the latest available (8/18/2008 from Supermicro).
I've seen the bug randomly with both 2.6.24-18-xen and 2.6.24-19-xen versions of the kernel. The process that dies can be anything in Dom0, DomU and seems unrelated to the actual process/module that is executed. It seems somewhat related to the order of processes loaded (in my case the order of domU startup). The 2.6.24-18 kernel seems a little more stable, but this could be a coincidence.
So far I've seen crashing a simple ext2 formatting, various processes in different domU, various processes in dom0. The offending process is somewhat sticky (leading me to believe a memory/hardware issue) but I ruled that out above.
The latest incarnation is a clamd process that won't live longher than few hours without crashing:
19240.984220] BUG: soft lockup - CPU#0 stuck for 11s! [clamd:2976] fixup+0x395/ 0x800 wake_function+ 0x0/0x40 0x18b/0x230 call+0x7/ 0xb wakeup+ 0x30/0x60 ======= ======= ==
[19240.984230]
[19240.984233] Pid: 2976, comm: clamd Not tainted (2.6.24-18-xen #1)
[19240.984237] EIP: 0061:[<c0327677>] EFLAGS: 00000286 CPU: 0
[19240.984245] EIP is at _spin_lock+0x7/0x10
[19240.984248] EAX: c1c2898c EBX: 00000000 ECX: c1c28980 EDX: 00000d88
[19240.984251] ESI: 50425067 EDI: 00000001 EBP: c0477158 ESP: e7f49ef4
[19240.984254] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
[19240.984260] CR0: 8005003b CR2: b70b1000 CR3: 28400000 CR4: 00002660
[19240.984264] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[19240.984267] DR6: ffff0ff0 DR7: 00000400
[19240.984271] [<c01759d5>] mprotect_
[19240.984284] [<c013bb90>] autoremove_
[19240.984293] [<c0175fcb>] sys_mprotect+
[19240.984299] [<c0105832>] syscall_
[19240.984305] [<c0320000>] vcc_def_
[19240.984310] =======
I'm currently running this particular domU with 2.4.26-21-xen kernel, for testing. Will report if it crashes.
Here's CPUinfo. Might be usefull:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
stepping : 6
cpu MHz : 1999.999
cache size : 6144 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu de tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc up arch_perfmon pebs bts pni monitor ds_cpl vmx tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips : 4004.96
clflush size : 64