Hardy kernel causes overheating

Bug #223081 reported by Janne Moren
14
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

I ran Feisty with no problems on my laptop (Panasonic Let's Note R6). After installing Hardy, the machine would quickly overheat and lock up whenever I did anything taxing the CPU's.

In an attempt to find the cause (and make the machine useable) I booted with the last Feisty kernel (2.6.22) left from before the upgrade. The machine works just fine with that change.

Pasting redacted, detailed machine information below (I would attach the file instead, but that crashes Firefox 3 - I'm reporting that as another bug):

    description: Notebook
    product: CF-R6AWBAJP
    vendor: Matsushita Electric Industrial Co.,Ltd.
    version: 002
    serial: ----
    width: 32 bits
    capabilities: smbios-2.4 dmi-2.4 smp-1.4 smp
    configuration: administrator_password=disabled boot=normal chassis=notebook cpus=2 power-on_password=disabled uuid=----
  *-core
       description: Motherboard
       product: CFR6-2
       vendor: Matsushita Electric Industrial Co.,Ltd.
       physical id: 0
       version: 001
       serial: None
     *-firmware
          description: BIOS
          vendor: Phoenix Technologies Ltd.
          physical id: 0
          version: V2.00L10 (04/09/2007)
          size: 123KiB
          capacity: 960KiB
          capabilities: pci pcmcia pnp upgrade shadowing escd cdboot bootselect edd int13floppy720 int5printscreen int9keyboard int10video acpi usb biosbootspecification netboot
     *-cpu:0
          description: CPU
          product: Intel(R) Core(TM)2 CPU U7500 @ 1.06GHz
          vendor: Intel Corp.
          physical id: 4
          bus info: cpu@0
          version: 6.15.2
          serial: 0000-06F2-0000-0000-0000-0000
          slot: IC1
          size: 1067MHz
          capacity: 1067MHz
          width: 64 bits
          clock: 133MHz
          capabilities: boot fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx x86-64 constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm cpufreq
          configuration: id=0
     *-pci
          description: Host bridge
          product: Mobile 945GM/PM/GMS, 943/940GML and 945GT Express Memory Controller Hub
          vendor: Intel Corporation
          physical id: 100
          bus info: pci@0000:00:00.0
          version: 03
          width: 32 bits
          clock: 33MHz
          configuration: driver=agpgart-intel module=intel_agp
        *-display:0 UNCLAIMED
             description: VGA compatible controller
             product: Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller
             vendor: Intel Corporation
             physical id: 2
             bus info: pci@0000:00:02.0
             version: 03
             width: 32 bits
             clock: 33MHz
             capabilities: msi pm vga_controller bus_master cap_list
             configuration: latency=0
        *-display:1 UNCLAIMED
             description: Display controller
             product: Mobile 945GM/GMS/GME, 943/940GML Express Integrated Graphics Controller
             vendor: Intel Corporation
             physical id: 2.1
             bus info: pci@0000:00:02.1
             version: 03
             width: 32 bits
             clock: 33MHz
             capabilities: pm bus_master cap_list
             configuration: latency=0
        *-multimedia
             description: Audio device
             product: 82801G (ICH7 Family) High Definition Audio Controller
             vendor: Intel Corporation
             physical id: 1b
             bus info: pci@0000:00:1b.0
             version: 02
             width: 64 bits
             clock: 33MHz
             capabilities: pm msi pciexpress bus_master cap_list
             configuration: driver=HDA Intel latency=0 module=snd_hda_intel
        *-pci:0
             description: PCI bridge
             product: 82801G (ICH7 Family) PCI Express Port 1
             vendor: Intel Corporation
             physical id: 1c
             bus info: pci@0000:00:1c.0
             version: 02
             width: 32 bits
             clock: 33MHz
             capabilities: pci pciexpress msi pm normal_decode bus_master cap_list
             configuration: driver=pcieport-driver
        *-pci:1
             description: PCI bridge
             product: 82801G (ICH7 Family) PCI Express Port 3
             vendor: Intel Corporation
             physical id: 1c.2
             bus info: pci@0000:00:1c.2
             version: 02
             width: 32 bits
             clock: 33MHz
             capabilities: pci pciexpress msi pm normal_decode bus_master cap_list
             configuration: driver=pcieport-driver
           *-network
                description: Wireless interface
                product: PRO/Wireless 3945ABG Network Connection
                vendor: Intel Corporation
                physical id: 0
                bus info: pci@0000:03:00.0
                logical name: eth1
                version: 02
                serial: 00:1b:77:a1:ab:c8
                width: 32 bits
                clock: 33MHz
                capabilities: pm msi pciexpress bus_master cap_list ethernet physical wireless
                configuration: broadcast=yes driver=ipw3945 driverversion=1.2.2mp firmware=14.2 1:0 () ip=192.168.1.2 latency=0 link=yes module=ipw3945 multicast=yes wireless=IEEE 802.11g
        *-pci:2
             description: PCI bridge
             product: 82801 Mobile PCI Bridge
             vendor: Intel Corporation
             physical id: 1e
             bus info: pci@0000:00:1e.0
             version: e2
             width: 32 bits
             clock: 33MHz
             capabilities: pci subtractive_decode bus_master cap_list
           *-network
                description: Ethernet interface
                product: RTL-8139/8139C/8139C+
                vendor: Realtek Semiconductor Co., Ltd.
                physical id: 1
                bus info: pci@0000:04:01.0
                logical name: eth0
                version: 10
                serial: -----
                size: 10MB/s
                capacity: 100MB/s
                width: 32 bits
                clock: 33MHz
                capabilities: pm bus_master cap_list ethernet physical tp mii 10bt 10bt-fd 100bt 100bt-fd autonegotiation
                configuration: autonegotiation=on broadcast=yes driver=8139too driverversion=0.9.28 duplex=half latency=32 link=no maxlatency=64 mingnt=32 module=8139too multicast=yes port=MII speed=10MB/s
           *-pcmcia
                description: CardBus bridge
                product: RL5c476 II
                vendor: Ricoh Co Ltd
                physical id: 5
                bus info: pci@0000:04:05.0
                version: 8d
                width: 64 bits
                clock: 33MHz
                capabilities: pcmcia bus_master cap_list
                configuration: driver=yenta_cardbus latency=176 maxlatency=5 mingnt=128 module=yenta_socket
                resources: iomemory:b00805040-b0080503f
           *-system
                description: SD Host controller
                product: R5C822 SD/SDIO/MMC/MS/MSPro Host Adapter
                vendor: Ricoh Co Ltd
                physical id: 5.1
                bus info: pci@0000:04:05.1
                version: 13
                width: 32 bits
                clock: 33MHz
                capabilities: pm bus_master cap_list
                configuration: driver=sdhci latency=32 module=sdhci
        *-isa
             description: ISA bridge
             product: 82801GBM (ICH7-M) LPC Interface Bridge
             vendor: Intel Corporation
             physical id: 1f
             bus info: pci@0000:00:1f.0
             version: 02
             width: 32 bits
             clock: 33MHz
             capabilities: isa bus_master cap_list
             configuration: latency=0
        *-ide
             description: IDE interface
             product: 82801G (ICH7 Family) IDE Controller
             vendor: Intel Corporation
             physical id: 1f.1
             bus info: pci@0000:00:1f.1
             version: 02
             width: 32 bits
             clock: 33MHz
             capabilities: ide bus_master
             configuration: driver=ata_piix latency=0 module=ata_piix
        *-storage
             description: SATA controller
             product: 82801GBM/GHM (ICH7 Family) SATA AHCI Controller
             vendor: Intel Corporation
             physical id: 1f.2
             bus info: pci@0000:00:1f.2
             logical name: scsi0
             version: 02
             width: 32 bits
             clock: 66MHz
             capabilities: storage msi pm ahci_1.0 bus_master cap_list emulated
             configuration: driver=ahci latency=0 module=ahci
           *-disk
                description: ATA Disk
                product: Hitachi HTS54161
                vendor: Hitachi
                physical id: 0.0.0
                bus info: scsi@0:0.0.0
                logical name: /dev/sda
                version: SB4O
                serial: SB24D7SJK77D2P
                size: 149GiB (160GB)
                capabilities: partitioned partitioned:dos
                configuration: ansiversion=5 signature=262b0777
              *-volume:0
                   description: EXT3 volume
                   vendor: Linux
                   physical id: 1
                   bus info: scsi@0:0.0.0,1
                   logical name: /dev/sda1
                   logical name: /
                   logical name: /dev/.static/dev
                   version: 1.0
                   serial: ----
                   size: 143GiB
                   capacity: 143GiB
                   capabilities: primary bootable journaled extended_attributes large_files huge_files recover ext3 ext2 initialized
                   configuration: created=2007-09-01 02:24:45 filesystem=ext3 modified=2008-04-27 19:24:46 mount.fstype=ext3 mount.options=rw,data=ordered mounted=2008-04-27 10:06:07 state=mounted
              *-volume:1
                   description: Extended partition
                   physical id: 2
                   bus info: scsi@0:0.0.0,2
                   logical name: /dev/sda2
                   size: 5906MiB
                   capacity: 5906MiB
                   capabilities: primary extended partitioned partitioned:extended
                 *-logicalvolume
                      description: Linux swap / Solaris partition
                      physical id: 5
                      logical name: /dev/sda5
                      capacity: 5906MiB
                      capabilities: nofs
        *-serial UNCLAIMED
             description: SMBus
             product: 82801G (ICH7 Family) SMBus Controller
             vendor: Intel Corporation
             physical id: 1f.3
             bus info: pci@0000:00:1f.3
             version: 02
             width: 32 bits
             clock: 33MHz
             configuration: latency=0

Tags: cft-2.6.27
Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better.

Please include the following additional information, if you have not already done so (pay attention to lspci's additional options), as required by the Ubuntu Kernel Team:
1. Please include the output of the command "uname -a" in your next response. It should be one, long line of text which includes the exact kernel version you're running, as well as the CPU architecture.
2. Please run the command "dmesg > dmesg.log" after a fresh boot and attach the resulting file "dmesg.log" to this bug report.
3. Please run the command "sudo lspci -vvnn > lspci-vvnn.log" and attach the resulting file "lspci-vvnn.log" to this bug report.

For your reference, the full description of procedures for kernel-related bug reports is available at https://wiki.ubuntu.com/KernelTeamBugPolicies

Thanks in advance!

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :
Revision history for this message
Janne Moren (jan-moren-gmail) wrote :
Revision history for this message
Janne Moren (jan-moren-gmail) wrote :
Revision history for this message
Janne Moren (jan-moren-gmail) wrote :
Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

A small addendum: The machine is running too hot with both the Hardy kernel (2.6.24) and the Gutsy one (2.6.22). The difference between the 2.6.24 kernel and the 2.6.22 kernel for me seems simply to be that the 2.6.22 kernel is able to reduce performance sufficiently to avoid the machine overheat, while the 2.6.24 allows it to get too hot. When the machine starts getting too hot (88 degrees), everything starts to slow down a lot (much more than just throttling cpu speed would allow). Scrolling starts to stutter, clicking a menu item can take seconds to respond and so on. On 2.6 24, of course, the machine jsut freezes completely.

One part cause for this turns out to be the wifi. Just having wifi enabled (via the physical switch on the machine) raises operating temperature from 72-80 degrees to 85-88 degrees (and stutter) on 2.6.22 and 85-freeze on 2.6.24. Note that I don't have to actually be using the wifi; just enabling the unit is enough. Specifically, the CPU activity does not increase in any way.

This did not happen under Gutsy. The low-level drivers are kernel-specific so I suspect they aren't to blame; I suspect perhaps NetworkManager is doing something strange.

But, though I can't confirm this, the machine seems to be running hotter than it did under Gutsy even with the 2.6.22 kernel and wifi turned off, so there may be other causes of this behaviour too.

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

More testing. Just booting the machine (into the Feisty kernel) and running stuff works just fine; the machine stays at around 40 degrees. Even when playing a movie full screen it does never go over 60 degrees or so. This is the normal behaviour from Feisty.

If I run "powertop" I get decent values, with about 60-150 wakeups per second or so from a idling machine with no applications running (the i915 bug mentioned on the powertop website seems to cause most of them). The cpu spends most of its time in C3 state.

Then I did "something" - basically only shut the lid of the machine, which is set to shut off the screen. Open it a while later, and idle temperature is 78 degrees and climbing. Still no applications and no network.

If I now run powertop I have 20k-40k interrupts per second (as in 20 000 - 40 000) or more, but there is no such interrupt hog on the top list. Here's a "screenshot" :

Cn Avg residency P-states (frequencies)
C0 (cpu running) (52,4%) 1067 Mhz 9,5%
C1 0,0ms ( 0,0%) 800 Mhz 90,5%
C2 0,0ms (47,6%)
C3 0,0ms ( 0,0%)

Wakeups-from-idle per second : 41669,5 interval: 10,0s
no ACPI power usage estimate available

Top causes for wakeups:
  34,3% ( 60,0) <interrupt> : uhci_hcd:usb4, i915@pci:0000:00:02.0
  23,0% ( 40,3) <interrupt> : uhci_hcd:usb3, ipw3945
  21,1% ( 36,9) <interrupt> : extra timer interrupt
   6,2% ( 10,8) S20powernowd : queue_delayed_work_on (delayed_work_timer_fn
   3,5% ( 6,1) firefox : futex_wait (hrtimer_wakeup)
   2,2% ( 3,9) <kärnmodul> : usb_hcd_poll_rh_status (rh_timer_func)

I can reliably freeze the machnine with the new kernel. Is there anything I can run, or any data I can collect while doing so that could help you help me to find the cause of this? At this time Hardy is frankly not usable for me (I can not run any work-related software due to the overheating and resulting stutter), and I need to resolve this reasonably soon. Any way I can help, please just let me know.

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :
Download full text (4.0 KiB)

I rebooted with power plug, wired network cable and USB mouse plugged in. Screen blanking and so on disabled. No applications running except for a gnome-terminal and a bash one-line loop to record temperature every ten secondd, and occasionally powertop. Temperature stays normal, at about 50 degrees.

Powertop at this time:

--

     PowerTOP version 1.9 (C) 2007 Intel Corporation

Cn Avg residency P-states (frequencies)
C0 (cpu running) ( 0,7%) 1067 Mhz 5,6%
C1 0,0ms ( 0,0%) 800 Mhz 94,4%
C2 25,6ms (76,1%)
C3 72,0ms (23,2%)

Wakeups-from-idle per second : 32,9 interval: 20,0s
no ACPI power usage estimate available

Top causes for wakeups:
  24,5% ( 8,7) <interrupt> : uhci_hcd:usb4, i915@pci:0000:00:02.0
  11,1% ( 4,0) S20powernowd : queue_delayed_work_on (delayed_work_timer_fn
  10,9% ( 3,9) <interrupt> : acpi
  10,2% ( 3,6) <kärnmodul> : usb_hcd_poll_rh_status (rh_timer_func)
   5,9% ( 2,1) scim-panel-gtk : schedule_timeout (process_timeout)
   5,6% ( 2,0) <kernel core> : queue_delayed_work_on (delayed_work_timer_fn
)

--

I try to close the lid (blanking the screen); unplugging and plugging in the power cord; reenabling the screen blank timeout setting in gnome-power-manager (does not seem to "take" though). I plug in a USB memory stick (which does not get automounted, perhaps due to me using Hardy kernel), write some files to it, and unmount and remove it. Throughout this, temperature stays normal and 'powertop' output looks like above.

Then I unplug the usb mouse (which has been plugged in and working since boot). Suddenly temperature jumps to about 80 degrees, and powertop output looks like this, constantly with over 20000 interrrupts per second:

--

     PowerTOP version 1.9 (C) 2007 Intel Corporation

Cn Avg residency P-states (frequencies)
C0 (cpu running) (63,4%) 1067 Mhz 7,2%
C1 0,0ms ( 0,0%) 800 Mhz 92,8%
C2 0,0ms (21,9%)
C3 0,0ms (14,6%)

Wakeups-from-idle per second : 21900,2 interval: 10,0s
no ACPI power usage estimate available

Top causes for wakeups:
  74,3% ( 60,0) <interrupt> : uhci_hcd:usb4, i915@pci:0000:00:02.0
   4,7% ( 3,8) S20powernowd : queue_delayed_work_on (delayed_work_timer_fn
   4,1% ( 3,3) <interrupt> : uhci_hcd:usb2, ahci, eth0
   2,5% ( 2,0) <kernel core> : queue_delayed_work_on (delayed_work_timer_fn
   2,5% ( 2,0) scim-panel-gtk : schedule_timeout (process_timeout)
   1,2% ( 1,0) multiload-apple : schedule_timeout (process_timeout)

--

When I plug the mouse back in again, the interrupt count doubles:

--

     PowerTOP version 1.9 (C) 2007 Intel Corporation

Cn Avg residency P-states (frequencies)
C0 (cpu running) (52,1%) 1067 Mhz 2,1%
C1 0,0ms ( 0,0%) 800 Mhz 97,9%
C2 0,0ms (47,9%)
C3 0,0ms ( 0,0%)

Wakeups-from-idle per second : 44448,9 interval: 10,0s
no ACPI power usage estimate available

Top causes for wakeups:
  71,3%...

Read more...

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

A few more notes:

* The enormous number of wakeups from the USB subsystem can still be triggered without unplugging a mouse; it is a lot less frequent though.

* The kernel still locks up.

* This kernel bug seems to be the same as #204996 (https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/204996)

* I'm not really sure it absolutely must be temperature related, though it does seem like it. The Gutsy kernel reacts by stuttering; Hardy reacts by freezing up. And that also explains why reboots have to wait a while - the machine is overheating and trips safety temperature levels again from the high system activity at startup.

Revision history for this message
petebass4life (pete-bass4life) wrote :

would you be able to attatch your syslog it would be interesting to look at comparing to a bug of mine

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

Here it is. What are you looking for?

Revision history for this message
Elod VALKAI (elod) wrote :

Janne, I've been having the same problem with closed lid & overheating, but it did not start with hardy. The laptop simply accumulates too much heat, but it does not crash. I think it's not starting the fan at all after closing the lid.

I've also updated my topic. 2.4.24-17-generic seems to freeze, but 2.4.25-1-generic from kernel-ppa does NOT (so far). Follow the instructions Leann gave ( https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/204996/comments/192 ). Could you try 2.4.25? It lacks restricted drivers.

Revision history for this message
Elod VALKAI (elod) wrote :

Kernels are 2.6.24-17-generic & 2.6.25-1-generic. It's 10 in the morning and I'm still asleep ;).

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

I'll take a look - thanks.

As I've described above, the kernel itself is not the cause of the overheating (it's the uhci driver and/or the i915 driver from what it seems now), but the kernels differ in how they are able to handle it.

For me it's not a fan issue as this computer is fanless. Previous to Hardy it never had any temperature issues; beginning with Hardy it's been running hot, though, without any visible system load to explain it (and anecdotally, other people in the lab have experienced the same with other laptop models too).

I'll try the 2.6.25 kernel; The overheating itself seems to be independent of the kernel version though, and is really the more serious issue in a way.

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

A bad news/good news kind of result.

The bad news: the 2.4.25-1 kernel also freezes like the 2.4.24 one, in my case still when the machine gets too hot, it seems.

The good news: the machine seems to run a bit cooler with this kernel, and I can't make the wakeups-per-second peak the way it so easily did with earlier kernels as I described above. The i915 driver no longer shows up at all as a cause of wakeups.

I can still provoke the kernel to lock up, but I have to really work at it - I could play through most of one round of "Desktop Tower Defense" (which really pushes this machine) before it locked up. With the 2.6.24 kernel I could just about start one game before the machine died. So for what it's worth, this kernel is a big improvement on both the Hardy and Gutsy kernels, heat wise, but the lockup issue is still there for me.

Revision history for this message
Elod VALKAI (elod) wrote :

I can confirm the drop in wakeups/sec (from 100 to about 50). The i915 driver disappeared completely from powertop. C3 states also work now, with 2.6.24 the cpu did not enter this state.

Thanks for the game-tip :D. It's cute, i've had about an hours fun with it, and a high score of 646 (epiphany crashed while trying to submit it). I'll keep playing it.

Temperatures are not an issue here, as the fan kicks in at 70 celsius. After driving down the temps to 50, it stops.
In idle it sits stable at 60 degrees. Could you stress your cpu with cpuburn for example and see if you can crash it with 2.6.22?

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

It seems the heating problem is a red herring, a false lead, an unrelated distraction.

It seems that the 2.6.22 and 2.6.24 kernels running the machine hot is not connected to this issue. I ran the 2.6.25-1 kernel during today, just doing normal office work, and constantly logging my CPU core temperature. It took until now in early afternoon, but the 2.6.25-1 kernel finally froze as well, while I was just doing some writing in gvim. SIgnificantly, the temperature log file shows the machine steadily at about 65-68 degrees, well within the normal operating temperature (it's temperature limited at 88 degrees).

So it seems the crashes are no connected to temperature after all. In hindsight it's kind of obvious: if the 22 and 24 kernels always make the computer run hot, it will be hot when it crashes whether it's connected or not.

Elod, my highscore on medium is about 7800 points. I've been doing a lot of.. um... testing to find this bug - yes, that's it, testing. That's all. Not addicted or anything; nope, not me.

I'm hesitant to run cpuburn (the warning is rather scary), but I have used this system for development and testing of simulations, which are pegging both cores at 100% whenever I test. This I have done for the entire Gutsy six-month cycle without ever crashing. And as I wrote above, heating seems to be a false lead in any case.

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

Thanks for the update.

Can you attach the file /var/log/kern.log just after a freeze occurs. Evidences may be caught in it.

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

Here's my kern.log, from boot to freeze. It's the 2.6.25 kernel so my wifi hardware doesn't actually work; those messages are unrelated to this issue.

I was thinking, are we sure this is a kernel krash? There is no core dump, and nothing in the logs. Could it possibly be the scheduler messing up somehow and not ever running any userland processes? That would look very much like this, with everything freezing solid but the screen intact and no trace of anything terminating.

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

I just tried with the 2.6.24-17 from Hardy-proposed. Still freezes.

Is there a way to see if the kernel actually crashes? Or if the kernel is running but something else is happening?

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

Time to eat a bit of crow.

I was running with the 2.6.24-17 kernel and the machine froze again. For various reasons I delayed in rebooting it, and got really surprised when, after a few minutes, the screen sloowly, jerkily dimmed. That made me keep on waiting, and after about five minutes the machine suddenly stumbled alive again: the mouse jerkily reproduced the buffered movements from when I fruitlessly had tried to move it; firefox scrolled around a bit for the same reason. Soon the jerkiness attenuated then disappeared and the machine is alive again.

So, it seems that whatever is happening to me is very much not an irreversible crash. Something freezes the machine - still no idea what - but it is something more like a starvation issue that eventually disappears again (or at least, disappeared once). It is worth noting that after the freeze, the usb or intel graphics subsystem is generating tens of thousands of events again; not sure if it is connected.

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

It looks rather like some deadlock or starvation issue. The UI and all userland applications freeze, but events like screen dimming still happen on schedule as I mentioned above. While the screen (jerkily, slowly) dims, I can actually move the mouse and do stuff; once the screen is dimmed it all freezes solid again. Events like turning off the screen when I close the lid, and turning it on when I open it still happen as well.

Occasionally the machine will actually recover if I wait long enough. Usually, however, it doesn't - or it perhaps would if I waited arbitrarily long, but I do have work I want to do.

Anything I can do to help find the cause more precisely?

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

Setting to confirmed, because required information is provided and the bug is reproducible with 2.6.24 and 2.6.25 and doesn't occurs with 2.6.22.

Changed in linux:
assignee: nobody → ubuntu-kernel-team
status: New → Confirmed
Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

Another data point: It crashed with a YouTube clip running and the last audio sample kept repeating over and over, indicating again that the system isn't completely dead as much as userland processes are blocked.

Once again, is there anything I can do to to help find this? It happens on every single kernel after the last Gutsy one, so it must be something in common with all of them. And one of the main things changed is the scheduler, which to me would make it a prime suspect. Anything I can do at all?

If there really isn't, then I am ready to throw in the towel by now, and find a way to downgrade this machine to Gutsy again. Or, if downgrading while keeping my data is not an option, try Redhat or Suse instead; they do have their own kernel builds which may be without this issue.

As it is, this machine is almost unuseable, since I have to be constantly prepared for another lockup, and since anything putting a load on the machine increases the chance. The Gutsy kernel doesn't lock up, but suspend doesn't work, and it suffers rather horribly from that excess of kernel events (much worse than when I was actually running Gutsy) which mean I soon have to reboot to get rid of it. I'm happy to keep at this as long as I have any kind of expectation that it will help resolve the issue, but if it doesn't help at all I'd rather have a functional system and give up on Hardy for now.

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

Hi Janne,

From your kern.log, you've got an issue with the wifi driver iwl3945. This is at least one difference between Gutsy and Hardy. Gutsy was using the driver ipw3945 and Hardy has moved to iwl3945. This may cause IRQ or ACPI problems. I've had similar problems with an Intel 4965 which were solved with a driver upgrade.

May 19 22:46:15 mocha kernel: [ 685.863964] ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 18 (level, low) -> IRQ 18
May 19 22:46:15 mocha kernel: [ 685.864316] PM: Writing back config space on device 0000:03:00.0 at offset 1 (was 100002, writing 100006)
May 19 22:46:15 mocha kernel: [ 685.864827] firmware: requesting iwlwifi-3945-1.ucode
May 19 22:46:15 mocha kernel: [ 685.880526] iwl3945: iwlwifi-3945-1.ucode firmware file req failed: Reason -2
May 19 22:46:15 mocha kernel: [ 685.880556] iwl3945: Could not read microcode: -2
May 19 22:46:15 mocha kernel: [ 685.880826] ACPI: PCI interrupt for device 0000:03:00.0 disabled

I don't have such hardware, so it's difficult for me to tell you what is the procedure to load the microcode.

Can you try to disable the wifi card (killswitch on and blacklist iwl3945) ?

Just after the boot check that the driver is not loaded, watch kernel log file to see if the error message is still there. Tell us if your system still locks up.

Thanks.

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

As I wrote earlier, that message I _only_ get when I test the 2.6.25 kernels, no doubt because that's an experimental kernel intended for 8.10 and so the microcode isn't included there yet. When running 2.6.24 I do not get that error message and wireless works fine.

I have also run with the killswitch on previously; it seems the hardware is still detected though. I'll try explicitly disabling the wireless drivers.

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

Disabling wireless as above made no difference at all. As before, all I need to do is load the system and soon this thing hits.

Would it be feasible to take the 2.6.24 kernel, say, and build it with a different scheduler?

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

Tried 2.6.26-18. It is the worst kernel so far. With -16 I can usually work for most of a day if I'm a bit careful about what I run (nothing that will peg the CPU for an extended period). With -18 I have not yet managed to stay up for more than two hours, and usually much less.

I have by the way also tried shifting between the "intel" and the "i810" drivers. No effect.

This is not the best system upgrade I have experienced to date.

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

2.6.26-19, same result.

I dug up my four-year old notebook in the same series as the one I have. Much the same, with Intel graphics, Pentium M and so on. One major difference is that it is single-core, unlike this dual-core machine. I did a fresh install of Hardy on it. And no matter how much I push it (and it's not a terribly hard machine to push) it stays up, and with none at all of the "stuttering" and other secondary issues I've been having.

That implies two possibilities: something goes wrong with the upgrade to Hardy from Gutsy; some file, some setting is left over and somehow manages to cause this. Or, this is an issue specifically with multi-core machines.

When time permits I will do a clean reinstall on this machine as well (oh joy...). If that resolves the issue, good. If not, I'm afraid I will have to start looking elsewhere for an OS to run; I've had this machine only semi-functional for almost two months now with no hint of a resolution so the situation is becoming untenable.

Revision history for this message
beefcurry (jonzwong) wrote :

I have this from the recent kernel updates in 8.04. I can only ration time spent using anything that might bump my cpu cycles up. I have turned off my Wlan, seems to contribute to around 4C in heat, Xorg itself running can contribute a fair 5-6C in heat. I can't isolate where its heating up since the RAM, GPU and CPU are really close together on my Dell XPS M1330 I guess all I can do now is carry on using CLI or going back to older Kernel versions. Heat recorded with "watch acpi -V"

Revision history for this message
IanG (ian-usts) wrote :

I'm unable to do a recent kernel update without the system shutting down because "critical temp" has been reached. Had an automatic shutdown, lost screen resolution and had to run dpkg --configure -a to correct errors.

Jun 17 10:31:03 ubuntu kernel: [ 4342.918510] ACPI: Critical trip point
Jun 17 10:31:03 ubuntu kernel: [ 4342.922346] ACPI: Unable to turn cooling device [f7c4d420] 'on'
Jun 17 10:31:15 ubuntu kernel: [ 4355.032824] ip6_tables: (C) 2000-2006 Netfilter Core Team

Revision history for this message
IanG (ian-usts) wrote :

Re my last comment - Linux ubuntu 2.6.24-19-386 #1 Wed Jun 4 15:54:02 UTC 2008 i686 GNU/Linux

I have to agree with Janne, this is the worst upgrade yet and I somehow wished I'd stayed with Gutsy for the time being. However, I am glad this is LTS so am hoping for a Dapper typre robustness eventually, although I am considering a clean install.

Revision history for this message
IanG (ian-usts) wrote :

The problem has got worse since updating this morning to the above. Now with just a few sites open, Kaffeine running and Thunderbird open but with the system generally idling I get the following prior to a forced automatic system shut down:

Linux ubuntu 2.6.24-19-386 #1 Wed Jun 4 15:54:02 UTC 2008 i686 GNU/Linux

Jun 17 13:57:21 ubuntu kernel: [ 910.955800] ACPI: Critical trip point
Jun 17 13:57:21 ubuntu kernel: [ 910.959632] ACPI: Unable to turn cooling devic
e [f7c4d420] 'on'
Jun 17 13:57:28 ubuntu kernel: [ 917.819421] ip6_tables: (C) 2000-2006 Netfilte
r Core Team
Jun 17 13:57:30 ubuntu exiting on signal 15

Revision history for this message
beefcurry (jonzwong) wrote :

Going back to the oldest Gutsy Kernel did not help much, still suffered from the strange freezes. Janne might be making a slight point about how temperature is not the only issue here.

Revision history for this message
IanG (ian-usts) wrote :

<sigh> This s getting silly:

Jun 19 09:30:22 ubuntu kernel: [ 5074.970741] ACPI: Critical trip point
Jun 19 09:30:22 ubuntu kernel: [ 5074.974575] ACPI: Unable to turn cooling device [f7c4d420] 'on'
Jun 19 09:30:29 ubuntu kernel: [ 5082.114430] ip6_tables: (C) 2000-2006 Netfilter Core Team
Jun 19 09:30:31 ubuntu exiting on signal 15

Feel like going back to Dapper and the sudden shutdowns can't be doing my hard drive any good I'd have thought unless handled properly

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

IanG: The problem I've reported is not temperature related. It's connected to the level of activity on the machine (the more activity the more likely to freeze); I can make it happen while the temperature is still well below any danger level.

Also, this is not a kernel crash. What I think I'm seeing is basically starvation of CPU resources to userland processes, either because the scheduler is misbehaving in some instances (this is a problem common to all kernels with the new scheduler while I do not see it on older kernels), or perhaps because some kernel module goes into a busy loop and never relinquishes control.

I believe you are seeing a different bug.

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

I am once again open to the possibility that this may be temperature related. Mea Culpa.

It's getting hot at work so I got a really small desk fan, and it is pointing on to the machine and myself. While it is running, the machine runs cool and I have had no lockups at all. A bit of experimentation at home gives similar results - without a fan, it will lock up a couple of times per day. With one blowing straight at the machine, it runs fine.

This is confusing. As I wrote above, logging of temperature has shown lockups even when the temperature is still somewhere below the warning zone at 88 degrees and far below the critical at 130. This gives rise to two questions:

* It is not conceivable that the machine would jump from about 85 degrees to 130 within about 10-20 seconds when it is not heavily loaded. If it is temperature related, it must be the 88 degree level that is causing trouble, alternatively some other kernel-specific temperature level.

* Normally it seems the machine does throttle itself once it reaches 88 degrees. Is there some in-kernel temperature monitoring that might respond in a bad way at that level and freeze user processes? It is worth noting that during a freeze the machine is running hotter, not cooler, as if it's busy-looping or something.

My temperature trip_points looks like this:

critical (S5): 130 C
passive: 88 C: tc1=0 tc2=3 tsp=40 devices=CPU0 CPU1

What do the tc and tsp parameters mean and would there be some point in tweaking this subsystem somehow?

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

On a whim I tried updating the CPU firmware from Intel. Didn't really help completely as such (though it did perhaps stabilize it some; I'll do some comparisons with and without).

This time I was running "watch acpi -V" in a shell, and deliberately hitting the machine with some heavy processing (run GreycStoration on a large image). The machine stabilized at about:

Battery 1: charged, 99%
Thermal 1: ok, 84.0 degrees C
Thermal 2: passive , 87.0 degrees C
AC Adapter 1: on-line

It finally froze, with these values (well within the safe zone). I quickly stuck a desk fan onto the machine and waited a couple of minutes, after which the machine sprang to life again as if nothing had happened, now with temperature in the low 60's.

No logs show anything odd; the only event recorded at this point was in syslog:

Jul 1 23:00:01 mocha /USR/SBIN/CRON[25369]: (root) CMD (test -x /usr/lib/atsar/atsa1 && /usr/lib/atsar/atsa1)

Which I don't know what the point is for (what needs those statistics?) but doesn't seem harmful; and of course preload, which again shouldn't cause freezes like this.

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

The -19 kernel was decent; it will lock up userland whenever the machine gets too hot as above but it didn't run very hot. With a desk fan pointed at the machine I can at least get some work done.

With the latest kernel update (2.6.24-20-generic) I'm back to the issue described at the beginning, with "wakeup from idle" of about 20000 to 40000 per second. This kernel causes the machine to run at over 85 degrees when userland is completely idle, with an external fan working. As far as this overheating and freezing issue goes it is as bad as any kernel we've had so far. Possibly the worst.

On the positive side, whatever changes there are between the -19 and -20 kernel must be the ones responsible for this regression, making that issue (huge number of interrupts) feasible to diagnose at least.

I would really, really appreciate at least a ping on this bug comment thread to know that trying to report this is not utterly useless at least. If nobody even reads this there is no point in me trying to report it.

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

Final post: booting up, leaving machine idle, and only opening a terminal to check the temperature:

With 2.6.24-19-generic SMP, the temperature is 64 degrees.
With 2.6.24-20-generic SMP, the temperature is 85 degrees.

The temp at which the CPU shuts down is 89 or so.

Changed in linux:
importance: Undecided → Medium
status: Confirmed → Triaged
Revision history for this message
IanG (ian-usts) wrote :

Well said Janne. This has been a terrible kernel for overheating issues which left me simply giving up and waiting for another dist-upgrade. The most disappointing thing has been the general lack of mediation.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

The Ubuntu Kernel Team is planning to move to the 2.6.27 kernel for the upcoming Intrepid Ibex 8.10 release. As a result, the kernel team would appreciate it if you could please test this newer 2.6.27 Ubuntu kernel. There are one of two ways you should be able to test:

1) If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test.

--or--

2) The upcoming Alpha5 for Intrepid Ibex 8.10 will contain this newer 2.6.27 Ubuntu kernel. Alpha5 is set to be released Thursday Sept 4. Please watch http://www.ubuntu.com/testing for Alpha5 to be announced. You should then be able to test via a LiveCD.

Please let us know immediately if this newer 2.6.27 kernel resolves the bug reported here or if the issue remains. More importantly, please open a new bug report for each new bug/regression introduced by the 2.6.27 kernel and tag the bug report with 'linux-2.6.27'. Also, please specifically note if the issue does or does not appear in the 2.6.26 kernel. Thanks again, we really appreicate your help and feedback.

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

"If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test."

Where do I get them? Should I grab it from the Intrepid tree? Will those packages work without a lot of installation breakage on Hardy?

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hi Janne,

You'll need to enable the Intrepid repository to get the 2.6.27 kernel. If you are not familiar with how to do this I'd suggest waiting for Alpha5 to test which is set to be released this Thursday (Sept 4). Implications of using the 2.6.27 kernel with Hardy are not completely known nor supported at the moment as 2.6.27 is intended for use with Intrepid. Thanks.

Revision history for this message
Elod VALKAI (elod) wrote :

I simply upgraded to Intrepid last week, since I've got some time, and bandwidth to spare.

2.6.27 is stable on my laptop, that was crashing constantly with 2.6.24. Intrepid is a release I'm looking forward to.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hi Janne,

Would you be able to confirm this is resolved for you as well? Thanks.

Changed in linux:
status: Triaged → Incomplete
Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

I will install 8.10 once it's been released and out for a while. I don't want another six months like this so I'll be monitoring user reports on machines similar to mine for a while before changing anything. Especially so since there was a bad-looking bug on Wifi hardware that might affect me, and that seems to be worked around only, and not fixed. When I can be reasonably confident things will be better than Hardy I will delete and reinstall.

So no, I can't confirm. I still have it on Hardy. I do not know yet whether Intrepid fixes it.

Revision history for this message
Janne Moren (jan-moren-gmail) wrote :

I've upgraded to Intrepid and the issue seems largely resolved. The machine still tends to run hot, but not excessively so. More important, the power regulation now works properly so that the system slows down in an orderly manner whenever the temperature rises too high. I've tried to stress it quite severly and while the whole machine slows down to a crawl it never locks up or freezes.

As far as I am concerned this bug is resolved. And a big thank you to whomever managed to resolve these issues for Intrepid.

Revision history for this message
IanG (ian-usts) wrote :

I've not been so lucky. At least three times since upgrading to Intrepid I've had automatic shutdowns due to kernel overheating problems, although it has settled down of late. The jury's still out for me

Revision history for this message
zeddock (zeddock) wrote :

My problem still exists on Dell Inspiron 8500 on 8.10, kernel 2.6.27

zeddock

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

@IanG and @zeddock, care to open new bug reports? It's likely this issue is hardware specific. Janne, the original bug reporter has commented this appears to be resolved for Intrepid. Thanks.

Changed in linux:
status: Incomplete → Fix Released
Revision history for this message
zeddock (zeddock) wrote : Re: [Bug 223081] Re: Hardy kernel causes overheating

It may have appeared to have been fixed in Intrepid... but it is not.

Thanx,

zeddock

On Thu, Nov 20, 2008 at 8:40 PM, Leann Ogasawara <email address hidden> wrote:

> @IanG and @zeddock, care to open new bug reports? It's likely this
> issue is hardware specific. Janne, the original bug reporter has
> commented this appears to be resolved for Intrepid. Thanks.
>
> ** Changed in: linux (Ubuntu)
> Status: Incomplete => Fix Released
>
> --
> Hardy kernel causes overheating
> https://bugs.launchpad.net/bugs/223081
> You received this bug notification because you are a direct subscriber
> of the bug.
>

Revision history for this message
Launchpad Janitor (janitor) wrote : Kernel team bugs

Per a decision made by the Ubuntu Kernel Team, bugs will longer be assigned to the ubuntu-kernel-team in Launchpad as part of the bug triage process. The ubuntu-kernel-team is being unassigned from this bug report. Refer to https://wiki.ubuntu.com/KernelTeamBugPolicies for more information. Thanks.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.