Hard freeze after upgrading to Lucid

Bug #545039 reported by Guilherme Salgado
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Medium
Unassigned

Bug Description

After leaving my box unattended for a few minutes, it froze and I couldn't restart X or switch to a VT. I was able to ssh in from another box, though, but restarting gdm wasn't enough, so I had to 'sudo reboot' it. This is what I found in /var/log/messages afterwards.

Mar 22 20:27:07 gorducho kernel: [ 9840.530153] Call Trace:
Mar 22 20:27:07 gorducho kernel: [ 9840.530167] [<ffffffff815558b7>] __mutex_lock_slowpath+0xe7/0x170
Mar 22 20:27:07 gorducho kernel: [ 9840.530178] [<ffffffff810568b0>] ? finish_task_switch+0x50/0xe0
Mar 22 20:27:07 gorducho kernel: [ 9840.530185] [<ffffffff815557ab>] mutex_lock+0x2b/0x50
Mar 22 20:27:07 gorducho kernel: [ 9840.530214] [<ffffffffa029936d>] i915_gem_retire_work_handler+0x3d/0xa0 [i915]
Mar 22 20:27:07 gorducho kernel: [ 9840.530233] [<ffffffffa0299330>] ? i915_gem_retire_work_handler+0x0/0xa0 [i915]
Mar 22 20:27:07 gorducho kernel: [ 9840.530242] [<ffffffff8107e9b7>] run_workqueue+0xc7/0x1a0
Mar 22 20:27:07 gorducho kernel: [ 9840.530249] [<ffffffff8107eb33>] worker_thread+0xa3/0x110
Mar 22 20:27:07 gorducho kernel: [ 9840.530256] [<ffffffff81083580>] ? autoremove_wake_function+0x0/0x40
Mar 22 20:27:07 gorducho kernel: [ 9840.530264] [<ffffffff8107ea90>] ? worker_thread+0x0/0x110
Mar 22 20:27:07 gorducho kernel: [ 9840.530269] [<ffffffff81083206>] kthread+0x96/0xa0
Mar 22 20:27:07 gorducho kernel: [ 9840.530277] [<ffffffff8101422a>] child_rip+0xa/0x20
Mar 22 20:27:07 gorducho kernel: [ 9840.530283] [<ffffffff81083170>] ? kthread+0x0/0xa0
Mar 22 20:27:07 gorducho kernel: [ 9840.530289] [<ffffffff81014220>] ? child_rip+0x0/0x20

That traceback actually seems to have happened a few times while I tried to recover: http://paste.ubuntu.com/399935/

And here's a more recent one: http://paste.ubuntu.com/409477/

ProblemType: Bug
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
Architecture: amd64
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: Intel [HDA Intel], device 0: STAC92xx Analog [STAC92xx Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: salgado 1637 F.... pulseaudio
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
 Card hw:0 'Intel'/'HDA Intel at 0xefebc000 irq 21'
   Mixer name : 'SigmaTel STAC9200'
   Components : 'HDA:83847690,102801d4,00102201 HDA:14f12bfa,14f100c3,00090000'
   Controls : 12
   Simple ctrls : 7
Date: Tue Mar 23 09:51:32 2010
DistroRelease: Ubuntu 10.04
EcryptfsInUse: Yes
Frequency: This has only happened once.
HibernationDevice: RESUME=UUID=cbe0dc53-7772-4cc4-a86f-62aa268678f5
InstallationMedia: Ubuntu 9.10 "Karmic Koala" - Release amd64 (20091027)
MachineType: Dell Inc. Latitude D520
Package: linux-image-2.6.32-16-generic 2.6.32-16.25
PccardctlIdent:
 Socket 0:
   no product info available
PccardctlStatus:
 Socket 0:
   no card
ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.32-16-generic root=UUID=6f67e650-b266-4687-bcd5-092f4e196505 ro quiet splash
ProcEnviron:
 PATH=(custom, user)
 LANG=en_US.utf8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.32-16.25-generic
Regression: Yes
RelatedPackageVersions: linux-firmware 1.32
Reproducible: No
SourcePackage: linux
TestedUpstream: No
Uname: Linux 2.6.32-16-generic x86_64
dmi.bios.date: 05/28/2007
dmi.bios.vendor: Dell Inc.
dmi.bios.version: A06
dmi.board.name: 0NF743
dmi.board.vendor: Dell Inc.
dmi.chassis.type: 8
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvrA06:bd05/28/2007:svnDellInc.:pnLatitudeD520:pvr:rvnDellInc.:rn0NF743:rvr:cvnDellInc.:ct8:cvr:
dmi.product.name: Latitude D520
dmi.sys.vendor: Dell Inc.

Revision history for this message
Guilherme Salgado (salgado) wrote :
Revision history for this message
collinp (collinp) wrote :

I've been having a similar problem for quite some time.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Guilherme Salgado (salgado) wrote :

Today it froze again, and when I logged in from another box to check the logs, that traceback was not there. It was only after I hit Ctrl+Alt+F1 (on the frozen box, obviously) that I saw the traceback in the logs.

Revision history for this message
Guilherme Salgado (salgado) wrote :
Download full text (16.5 KiB)

Apr 4 19:35:48 gorducho rsyslogd: [origin software="rsyslogd" swVersion="4.2.0" x-pid="908" x-info="http://www.rsyslog.com"] rsyslogd was HUPed, type 'lightweight'.
Apr 4 20:24:51 gorducho kernel: [ 4320.603907] i915 D 00000000ffffffff 0 749 2 0x00000000
Apr 4 20:24:51 gorducho kernel: [ 4320.603913] ffff8800afad7d20 0000000000000046 0000000000015b80 0000000000015b80
Apr 4 20:24:51 gorducho kernel: [ 4320.603919] ffff8800b0505f80 ffff8800afad7fd8 0000000000015b80 ffff8800b0505bc0
Apr 4 20:24:51 gorducho kernel: [ 4320.603924] 0000000000015b80 ffff8800afad7fd8 0000000000015b80 ffff8800b0505f80
Apr 4 20:24:51 gorducho kernel: [ 4320.603929] Call Trace:
Apr 4 20:24:51 gorducho kernel: [ 4320.603941] [<ffffffff8153f3a7>] __mutex_lock_slowpath+0xe7/0x170
Apr 4 20:24:51 gorducho kernel: [ 4320.603946] [<ffffffff8153f29b>] mutex_lock+0x2b/0x50
Apr 4 20:24:51 gorducho kernel: [ 4320.603974] [<ffffffffa0246530>] intel_idle_update+0x60/0x1d0 [i915]
Apr 4 20:24:51 gorducho kernel: [ 4320.603988] [<ffffffffa02464d0>] ? intel_idle_update+0x0/0x1d0 [i915]
Apr 4 20:24:51 gorducho kernel: [ 4320.603994] [<ffffffff81080757>] run_workqueue+0xc7/0x1a0
Apr 4 20:24:51 gorducho kernel: [ 4320.603998] [<ffffffff810808d3>] worker_thread+0xa3/0x110
Apr 4 20:24:51 gorducho kernel: [ 4320.604003] [<ffffffff81085300>] ? autoremove_wake_function+0x0/0x40
Apr 4 20:24:51 gorducho kernel: [ 4320.604007] [<ffffffff81080830>] ? worker_thread+0x0/0x110
Apr 4 20:24:51 gorducho kernel: [ 4320.604011] [<ffffffff81084f86>] kthread+0x96/0xa0
Apr 4 20:24:51 gorducho kernel: [ 4320.604016] [<ffffffff810141ea>] child_rip+0xa/0x20
Apr 4 20:24:51 gorducho kernel: [ 4320.604020] [<ffffffff81084ef0>] ? kthread+0x0/0xa0
Apr 4 20:24:51 gorducho kernel: [ 4320.604023] [<ffffffff810141e0>] ? child_rip+0x0/0x20
Apr 4 20:24:51 gorducho kernel: [ 4320.604044] Xorg D 0000000000000000 0 1115 1053 0x00400004
Apr 4 20:24:51 gorducho kernel: [ 4320.604049] ffff8800a85ffca8 0000000000000086 0000000000015b80 0000000000015b80
Apr 4 20:24:51 gorducho kernel: [ 4320.604054] ffff8800b8bf5f80 ffff8800a85fffd8 0000000000015b80 ffff8800b8bf5bc0
Apr 4 20:24:51 gorducho kernel: [ 4320.604058] 0000000000015b80 ffff8800a85fffd8 0000000000015b80 ffff8800b8bf5f80
Apr 4 20:24:51 gorducho kernel: [ 4320.604063] Call Trace:
Apr 4 20:24:51 gorducho kernel: [ 4320.604068] [<ffffffff8153f3a7>] __mutex_lock_slowpath+0xe7/0x170
Apr 4 20:24:51 gorducho kernel: [ 4320.604072] [<ffffffff8153f29b>] mutex_lock+0x2b/0x50
Apr 4 20:24:51 gorducho kernel: [ 4320.604084] [<ffffffffa023429b>] i915_gem_throttle_ioctl+0x3b/0x90 [i915]
Apr 4 20:24:51 gorducho kernel: [ 4320.604101] [<ffffffffa01b9e2a>] drm_ioctl+0x27a/0x480 [drm]
Apr 4 20:24:51 gorducho kernel: [ 4320.604113] [<ffffffffa0234260>] ? i915_gem_throttle_ioctl+0x0/0x90 [i915]
Apr 4 20:24:51 gorducho kernel: [ 4320.604120] [<ffffffff810397a9>] ? default_spin_lock_flags+0x9/0x10
Apr 4 20:24:51 gorducho kernel: [ 4320.604124] [<ffffffff815406ff>] ? _spin_lock_irqsave+0x2f/0x40
Apr 4 20:24:51 gorducho kernel: [ 4320.604129] [<ffffffff81152942>] vfs_ioctl+0x22/0xa0
Apr 4 20:24:51 gorduch...

description: updated
Revision history for this message
Guilherme Salgado (salgado) wrote :

Output of dmesg while system was frozen. I also saw that the X process was doing uninterruptible IO at that time.

Revision history for this message
Guilherme Salgado (salgado) wrote :
Revision history for this message
Chase Douglas (chasedouglas) wrote :

@Guilherme:

I've uploaded a test kernel to http://people.canonical.com/545039/. Please install it and test it out. When the bug occurs, hopefully we will see some output in dmesg that will help us diagnose the mutex issue. It isn't guaranteed that any useful data will be output though, so this is just a first step.

Thanks

Revision history for this message
Guilherme Salgado (salgado) wrote :

As one should expect, now that I'm running the debug kernel, the laptop's been running happily for more than 20h without a single crash. I'll leave it running until it crashes again, though.

Changed in linux (Ubuntu):
importance: Undecided → Medium
Revision history for this message
Guilherme Salgado (salgado) wrote :

After restarting with the latest kernel (-21), it crashed again but I still haven't managed to crash it with the debug kernel. I should probably use apt to pin my kernel to this debug version so that it never crashes again. ;)

Revision history for this message
Chase Douglas (chasedouglas) wrote :

The debug kernel may slow things down just enough so that the issue doesn't occur. That makes this a very difficult bug to figure out. I'll need to think some more on how to debug this.

Revision history for this message
Chase Douglas (chasedouglas) wrote :

One useful thing to try would be to trace the graphics driver stack to determine what is going on. Please run the following on the latest Ubuntu kernel:

$ sudo sh -c "echo ':mod:i915 :mod:drm :mod:drm_kms_helper check_hung_uninterruptible:traceoff:' > /sys/kernel/debug/tracing/set_ftrace_filter"
$ sudo sh -c "echo 1 > /sys/kernel/debug/trace/tracing_enabled"
$ sudo sh -c "echo 1 > /sys/kernel/debug/trace/tracing_on"
$ sudo sh -c "echo function_graph > /sys/kernel/debug/trace/current_tracer"

Wait for the issue to occur, then

$ bzip2 -c /sys/kernel/debug/tracing/trace > ftrace.bz2

Finally, attach the trace file to this bug.

Thanks

Revision history for this message
Guilherme Salgado (salgado) wrote :

After waiting almost a week for the crash to happen on the -21 kernel, I remembered the last time it happened was after a reboot when the -21 kernel had just been installed, so I've 'apt-get --reinstall'ed that kernel, restarted and then didn't have to wait more than 10 hours for the crash to happen again. Here's the zipped /sys/kernel/debug/tracing/trace

Revision history for this message
Chase Douglas (chasedouglas) wrote :

@Guilherme:

This appears to be the stack trace where the issue arises:

 1) | drm_ioctl() {
 1) 0.421 us | drm_ut_debug_printk();
 1) | i915_gem_set_domain_ioctl() {
 1) 0.624 us | drm_gem_object_lookup();
 1) 0.504 us | intel_mark_busy();
 1) | i915_gem_object_set_to_gtt_domain() {
 1) 0.456 us | i915_gem_object_flush_gpu_write_domain();
 1) | i915_gem_object_wait_rendering() {
 1) | i915_do_wait_request() {
 1) 2.018 us | i915_user_irq_get();

At this point it would be helpful if we could confirm this as being present in the latest mainline kernel or not. If it is, we will need to open a bug against the i915 driver to get this resolved as it's beyond my ability to figure out what is going wrong here. The issue seems to be a real software-hardware interaction issue and not a pure software bug.

Please download and test the latest mainline kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/.

Thanks

Revision history for this message
Chase Douglas (chasedouglas) wrote :

Uggg, launchpad reformats that to be really ugly. You find this in the ftrace.bz2 file by searching for the last instance of i915_user_irq_get.

Revision history for this message
Guilherme Salgado (salgado) wrote :

Chase, are the packages from kernel.ubuntu.com/~kernel-ppa/mainline/daily/2010-04-28-lucid/ the ones I should install or do you mean something else by latest mainline?

Revision history for this message
Chase Douglas (chasedouglas) wrote :

@Guilherme:

I meant the 2.6.34-rc5-lucid or later image. Sorry for the confusion!

Revision history for this message
Guilherme Salgado (salgado) wrote :

I've been running 2.6.34-020634rc5-generic since the 29th and it didn't crash so far.

As an experiment, this morning I rebooted using 2.6.32-21-generic and at the end of the day it crashed.

I think it's fair to assume 2.6.34-020634rc5 doesn't have this issue, but I'll keep running it as anything else means daily crashes to me.

Revision history for this message
Chase Douglas (chasedouglas) wrote :

@Guiherme:

I just want you to know that I probably won't be able to get back to this bug for a bit due to the ubuntu developer summit next week. However, this is still on my list of things to look at.

I hope the mainline kernel can tide you over for the time being.

Thanks

tags: added: kernel-acpi
tags: added: kernel-needs-review
tags: removed: kernel-acpi kernel-needs-review
Revision history for this message
Guilherme Salgado (salgado) wrote :
Download full text (4.1 KiB)

It's just happened again. This time when running 2.6.34-020634rc5-generic

AFAICT (by looking at the logs), this is the first time it happened when running this kernel.

May 26 12:44:46 gorducho kernel: [190200.831234] i915 D ffff880001e15740 0 712 2 0x00000000
May 26 12:44:46 gorducho kernel: [190200.831243] ffff8800b9943d40 0000000000000046 0000000000000001 ffff8800b9943fd8
May 26 12:44:46 gorducho kernel: [190200.831253] ffff8800ba3adc40 0000000000015740 0000000000015740 ffff8800b9943fd8
May 26 12:44:46 gorducho kernel: [190200.831262] 0000000000015740 ffff8800b9943fd8 0000000000015740 ffff8800ba3adc40
May 26 12:44:46 gorducho kernel: [190200.831271] Call Trace:
May 26 12:44:46 gorducho kernel: [190200.831314] [<ffffffffa02240f0>] ? intel_idle_update+0x0/0x100 [i915]
May 26 12:44:46 gorducho kernel: [190200.831327] [<ffffffff8153df3b>] __mutex_lock_slowpath+0xeb/0x180
May 26 12:44:46 gorducho kernel: [190200.831337] [<ffffffff8100985b>] ? __switch_to+0xbb/0x2e0
May 26 12:44:46 gorducho kernel: [190200.831346] [<ffffffff8105118e>] ? put_prev_entity+0x2e/0x80
May 26 12:44:46 gorducho kernel: [190200.831372] [<ffffffffa02240f0>] ? intel_idle_update+0x0/0x100 [i915]
May 26 12:44:46 gorducho kernel: [190200.831380] [<ffffffff8153db5b>] mutex_lock+0x2b/0x50
May 26 12:44:46 gorducho kernel: [190200.831404] [<ffffffffa0224130>] intel_idle_update+0x40/0x100 [i915]
May 26 12:44:46 gorducho kernel: [190200.831413] [<ffffffff8107a10c>] run_workqueue+0xbc/0x190
May 26 12:44:46 gorducho kernel: [190200.831420] [<ffffffff8107a65b>] worker_thread+0x9b/0x100
May 26 12:44:46 gorducho kernel: [190200.831428] [<ffffffff8107edc0>] ? autoremove_wake_function+0x0/0x40
May 26 12:44:46 gorducho kernel: [190200.831435] [<ffffffff8107a5c0>] ? worker_thread+0x0/0x100
May 26 12:44:46 gorducho kernel: [190200.831442] [<ffffffff8107e9e6>] kthread+0x96/0xa0
May 26 12:44:46 gorducho kernel: [190200.831449] [<ffffffff8100be64>] kernel_thread_helper+0x4/0x10
May 26 12:44:46 gorducho kernel: [190200.831456] [<ffffffff8107e950>] ? kthread+0x0/0xa0
May 26 12:44:46 gorducho kernel: [190200.831463] [<ffffffff8100be60>] ? kernel_thread_helper+0x0/0x10
May 26 12:44:46 gorducho kernel: [190200.831491] Xorg D ffff880001f15740 0 1047 1027 0x00400004
May 26 12:44:46 gorducho kernel: [190200.831500] ffff8800b94c3cc8 0000000000000082 ffff8800b94c3c78 ffff8800b94c3fd8
May 26 12:44:46 gorducho kernel: [190200.831509] ffff8800b944ae20 0000000000015740 0000000000015740 ffff8800b94c3fd8
May 26 12:44:46 gorducho kernel: [190200.831517] 0000000000015740 ffff8800b94c3fd8 0000000000015740 ffff8800b944ae20
May 26 12:44:46 gorducho kernel: [190200.831526] Call Trace:
May 26 12:44:46 gorducho kernel: [190200.831534] [<ffffffff8153df3b>] __mutex_lock_slowpath+0xeb/0x180
May 26 12:44:46 gorducho kernel: [190200.831542] [<ffffffff8153db5b>] mutex_lock+0x2b/0x50
May 26 12:44:46 gorducho kernel: [190200.831565] [<ffffffffa0216f8f>] i915_gem_ring_throttle+0x3f/0x80 [i915]
May 26 12:44:46 gorducho kernel: [190200.831588] [<ffffffffa0216fe1>] i915_gem_throttle_ioctl+0x11/0x20 [i915]
May 26 12:44:46 gorducho kernel: [190200.831613] [<ffffffffa01...

Read more...

Revision history for this message
Chase Douglas (chasedouglas) wrote :

I would suggest trying the released 2.6.34 kernel, and also trying to 2.6.35-rc1 kernel when it comes out (probably in a week or two). Outside of that, this looks like a bug that will need to be filed upstream at bugzilla.kernel.org. Note that at kernel.org they have a higher level of expectation of bug reporters, so if you have any questions about what they are asking you to do, please feel free to email me or <email address hidden>. If you do file a bug, please mention it in a comment here so we can link this bug to the upstream report.

Thanks!

tags: added: kernel-graphics kernel-needs-review
Steve Conklin (sconklin)
tags: added: kernel-reviewed
removed: kernel-needs-review
tags: removed: regression-potential
Revision history for this message
penalvch (penalvch) wrote :

Guilherme Salgado, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? If so, could you please test for this with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you please run the following command in the development release from a Terminal (Applications->Accessories->Terminal), as it will automatically gather and attach updated debug information to this report:

apport-collect -p linux <replace-with-bug-number>

Also, could you please test the latest upstream kernel available following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Please do not test the kernel in the daily folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested and remove the tag:
needs-upstream-testing

This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the text:
needs-upstream-testing

If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested.

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

If you are unable to test the mainline kernel, please comment as to why specifically you were unable to test it and add the following tags:
kernel-unable-to-test-upstream
kernel-unable-to-test-upstream-VERSION-NUMBER

Please let us know your results. Thank you for your understanding.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.