Playing around in a text console, it appears that the "freeze" is caused by a kernel panic that occurs when we hit a BUG in a bh context at kernel/timer.c, line 411 (in cascade, an inline function that appears in __run_timers, which itself is an inline function that appears in run_timer_softirq, which is run in a bh context).
Here's the trace I was working from. The code indicates that it's line 411 of some file:
Code: ... <0f> 0b 9b 01 ee 03 30 c0 ...
It panics, so I couldn't conclusively trace which file from the panic, but I think it's clearly kernel/timer.c, given the stack trace.
I'm doing some testing now, as to how to fix this. Since this is a double-fault, I'm curious if we can't just disable BUG()'s in the kernel (by recompiling). I don't know if the system will recover or not - it realizes something's wrong, but we panic since it reports the BUG in BH context, not necessarily because the problem is un-recoverable.
I just did a build with big-kernel-lock pre-emption (CONFIG_PREEMPT_BKL) turned off, and I didn't see any touble. So this may be a pre-emption issue. The kernel that had the problem and the one that didn't is a pretty big configuration delta, so I'm still trying to figure out what fixed it.
Here is some additional information:
Playing around in a text console, it appears that the "freeze" is caused by a kernel panic that occurs when we hit a BUG in a bh context at kernel/timer.c, line 411 (in cascade, an inline function that appears in __run_timers, which itself is an inline function that appears in run_timer_softirq, which is run in a bh context).
Here's the trace I was working from. The code indicates that it's line 411 of some file:
Code: ... <0f> 0b 9b 01 ee 03 30 c0 ...
It panics, so I couldn't conclusively trace which file from the panic, but I think it's clearly kernel/timer.c, given the stack trace.
[<c01259a2>] run_timer_ softirq+ 0x132/0x1d0 0x4f/0xb0 0x35/0x40 interrupt+ 0x1a/0x20
[<c01214ff>] __do_softirq+
[<c0121595>] do_softirq+
[<c0121645>] irq_exit+0x35/0x40
[<c01059cf>] do_IRQ+0x1f/0x30
[<c010410a>] common_
I'm doing some testing now, as to how to fix this. Since this is a double-fault, I'm curious if we can't just disable BUG()'s in the kernel (by recompiling). I don't know if the system will recover or not - it realizes something's wrong, but we panic since it reports the BUG in BH context, not necessarily because the problem is un-recoverable.
I just did a build with big-kernel-lock pre-emption (CONFIG_ PREEMPT_ BKL) turned off, and I didn't see any touble. So this may be a pre-emption issue. The kernel that had the problem and the one that didn't is a pretty big configuration delta, so I'm still trying to figure out what fixed it.
Still looking...