Comment 31 for bug 177518

Revision history for this message
In , mcisely (isely) wrote :

Created an attachment (id=15005)
standard signed-off kernel patch to fix intel driver lockup on buffer swap

Running an opengl application which attempts to synchronize buffer swapping to the vertical sync event eventually results in a hard lockup of the Linux kernel. When this happens even trying the magic sysrq key won't show any signs of life. This problem takes from 5 to 90 minutes of running before the lockup happens, and you need to be running dual head cloned mode for it to happen at all. The root cause is not related to dual head, but the less predictable interrupt rate when running this way I am guessing really exacerbates the underlying race condition, raising the probability of failure.

A standard signed-off kernel patch which fixes the problem is attached to this bug. I am not a git expert so I apologize that I don't have a git changeset built against the drm git repository. (I did however check that this problem also still appears to be present in the git repository.) It should be trivial to apply this patch in any case. This patch was built against the 2.6.24.3 vanilla kernel source tree.

Following are the comments from the attached patch:

<blockquote>
The i915_vblank_swap() function schedules an automatic buffer swap
upon receipt of the vertical sync interrupt. Such an operation is
lengthy so it can't happen normal interrupt context, so the DRM
implements this by scheduling the work in a kernel softirq-scheduled
tasklet. In order for the buffer swap to work safely, the DRM's
central lock must be taken, via a call to drm_lock_take() in drm_irq.c
within the function drm_locked_tasklet_func(). The lock-taking logic
uses a non-interrupt-blocking spinlock to implement the manipulations
needed to take the lock. Note that a non-interrupt-blocking spinlock
blocks kernel pre-emption and atomically sets a flag, but interrupts
are still enabled. This semantic is safe if ALL attempts to use the
spinlock only happen from process context. However this buffer swap
happens from softirq context which is really a form of interrupt
context that WILL pre-empt execution even when normal thread
pre-emption is otherwise disabled. Thus we have an unsafe situation,
in that drm_locked_tasklet_func() can block on a spinlock already
taken by a thread in process context which will never get scheduled
again because of the blocked softirq tasklet. This wedges the kernel
hard.

It's a very small race condition, but a race nonetheless with a very
undesirable potential outcome. To trigger this bug, run a dual-head
cloned mode configuration which uses the i915 drm, then execute an
opengl application which synchronizes buffer swaps against the
vertical sync interrupt. In my testing, a lockup always results after
running anywhere from 5 minutes to an hour and a half. I believe
dual-head is needed to really trigger the problem because then the
vertical sync interrupt handling is no longer predictable (due to
being interrupt-sourced from two different heads running at different
speeds). This raises the probability of a the tasklet trying to run
while the userspace DRI is doing things to the GPU (and manipulating
the DRM lock).

The fix is to change the relevant spinlock semantics to be the
interrupt-blocking form. Thus all calls for this spinlock change from
spin_lock() to spin_lock_irqsave() and similarly calls to
spin_unlock() change to spin_unlock_irqrestore(). After this change I
am no longer able to trigger the lockup; the longest test run so far
was 6 hours (test stopped after that point).

Signed-off-by: Mike Isely <email address hidden>
</blockquote>