ata timeout exception with ahci libata driver (was with 2.6.28-11, but i confirmed it affewcts previous kernels too)

Bug #352197 reported by José Tomás Atria
52
This bug affects 8 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
High
Stefan Bader

Bug Description

Binary package hint: linux-image-2.6.28-11-generic

package: linux-image-2.6.28-11
release: Jaunty Beta
latest version tested: 2.6.28-11.38

Trying to boot the new 2.6.28-11 kernel on Jaunty Beta results in many ata timeout errors (like the one copied below) that make the system unusable. Sounds much more serious than https://bugs.launchpad.net/ubuntu/+source/linux/+bug/343919, and different, in as much as it is not an ext4 fs that I am aware of, it is not an nvidia disk controller, and these errors are many more than ~10, and trivially easy to reproduce (it happens every time, from the first disk-reads onward)

dmesg error message:
[ 75.816051] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 75.816135] ata1.00: cmd ca/00:10:50:27:1e/00:00:00:00:00/e9 tag 0 dma 8192 out
[ 75.816137] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 75.816275] ata1.00: status: { DRDY }
[ 75.816337] ata1: hard resetting link
[ 76.804036] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 76.806704] ata1.00: configured for UDMA/100
[ 76.806715] ata1: EH complete

Note that booting 2.6.27 does not produce any disk error. smartctl reports no disk health problems, either.
[update] the bug is present in previouos versions too, and it only appears when using the ahci dirver for libata (see lspci -vv output below).

lspci -vv for IDE controllers:
00:1f.1 IDE interface: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) IDE Controller (rev 03) (prog-if 8a [Master SecP PriP])
        Subsystem: Sony Corporation Device 81b9
        Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin B routed to IRQ 18
        Region 0: I/O ports at 01f0 [size=8]
        Region 1: I/O ports at 03f4 [size=1]
        Region 2: I/O ports at 0170 [size=8]
        Region 3: I/O ports at 0374 [size=1]
        Region 4: I/O ports at 1880 [size=16]
        Kernel driver in use: ata_piix
        Kernel modules: ata_piix

00:1f.2 IDE interface: Intel Corporation 82801FBM (ICH6M) SATA Controller (rev 03) (prog-if 8f [Master SecP SecO PriP PriO])
        Subsystem: Sony Corporation Device 81ba
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin B routed to IRQ 18
        Region 0: I/O ports at 18c0 [size=8]
        Region 1: I/O ports at 18b8 [size=4]
        Region 2: I/O ports at 18b0 [size=8]
        Region 3: I/O ports at 1894 [size=4]
        Region 4: I/O ports at 18a0 [size=16]
        Region 5: Memory at 80004400 (32-bit, non-prefetchable) [size=1K]
        Capabilities: <access denied>
        Kernel driver in use: ata_piix
        Kernel modules: ahci, ata_piix

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hi Gorgonzola,

Can you comment if any of the previous 2.6.28 Jaunty kernels exhibited this behavior? Care to give the upstream kernel a quick test as well. The Ubuntu kernel team has started packaging the upstream kernels - https://wiki.ubuntu.com/KernelMainlineBuilds . Thanks.

Changed in linux (Ubuntu):
importance: Undecided → High
status: New → Triaged
Revision history for this message
José Tomás Atria (jtatria) wrote : Re: [Bug 352197] Re: ata timeout exception with linux 2.6.28-11

Hello Leann:

Thanks for your reply. I can now confirm that this bug, on this
particular box, affects other kernels previous to 2.6.28. I am now
experiencing the same behaviour in a Intrepid install with kernel
2.1.27.14.

2.6.27-7 -> ok (intrepid)
2.6.27-11 -> ok (intrepid)
2.6.27-14 -> ata timeout. (intrepid)
2.6.27-11 -> ok (jaunty)
2.6.28-11 -> ata timeout (jaunty)

So it would seem that the bug appeared somewhere between 2.6.27-11 and 2.6.27-14

The computer is a VAIO VGN-s560p, it has a 120 gb Hitachi driver.

On Wed, Apr 1, 2009 at 9:40 AM, Leann Ogasawara
<email address hidden> wrote:
> Hi Gorgonzola,
>
> Can you comment if any of the previous 2.6.28 Jaunty kernels exhibited
> this behavior?  Care to give the upstream kernel a quick test as well.
> The Ubuntu kernel team has started packaging the upstream kernels -
> https://wiki.ubuntu.com/KernelMainlineBuilds .  Thanks.
>
> ** Changed in: linux (Ubuntu)
>   Importance: Undecided => High
>
> ** Changed in: linux (Ubuntu)
>       Status: New => Triaged
>
> --
> ata timeout exception with linux 2.6.28-11
> https://bugs.launchpad.net/bugs/352197
> You received this bug notification because you are a direct subscriber
> of the bug.
>

--
entia non sunt multiplicanda praeter necessitatem

tags: added: intrepid regression-release
Revision history for this message
José Tomás Atria (jtatria) wrote :

I've had the chance to further examine this bug on my box. I've found
some things that could even make the bug report not applicable, but
i'd like to ask for advice on further testing:

1.- the problem is not exclusive to the kernel versions mentioned
above, it just hadn't appeared before on the ones marked ok. I have
experienced the same problem with all tested kernels from 2.6.27-7 and
above.

2.- i have experienced the same problem with kubuntu hardy, intrepid
and jaunty beta install disks, as well as debian lenny.

3.- the problem is completely erratic. sometimes it appears at boot,
sometimes it appears on x startup, sometimes only after a few minutes
after the session login. It seems to be related to high disk IO, but i
could not be certain of this, ie i don't know if the problem appears
with the machine "idle".

4.- but most important of all, i have solved it, apparently:
blacklisting ahci in /etc/modprobe.d/blacklist makes the problem go
away, by forcing libata to use the ata_piix driver for all drives (im
sure there must be a more elegant way of doing this, but after
googling for days on "how to disable ahci" i gave up and just
blacklisted the module to test it).

Now, the strange part is that this configuration should be properly
supported by ahci, as the ICH6M controller is thoroughly tested, and
the Hitachi disk that was causing problems is not particularly
esotheric... but i wouldn't know how to test this further, much less
change ahci config params to see if i can get this drive on this
controller to work (if theres any params that could make the disk
behave).

In any case, the bug (?) seems to be located in the AHCI module, and
it seems to affect this HW configuration only, so i don't know if it
makes any sense keepin it open.

I'm willing to try anything suggested if it helps to isolate the issue.

thanks for your time!

On Mon, Apr 6, 2009 at 1:35 PM, Leann Ogasawara
<email address hidden> wrote:
> ** Tags added: intrepid regression-release
>
> --
> ata timeout exception with linux 2.6.28-11
> https://bugs.launchpad.net/bugs/352197
> You received this bug notification because you are a direct subscriber
> of the bug.
>

--
entia non sunt multiplicanda praeter necessitatem

Stefan Bader (smb)
Changed in linux (Ubuntu):
assignee: nobody → stefan-bader-canonical
description: updated
summary: - ata timeout exception with linux 2.6.28-11
+ ata timeout exception with ahci libata driver (was with 2.6.28-11, but i
+ confirmed it affewcts previous kernels too)
Revision history for this message
José Tomás Atria (jtatria) wrote :

I updated the bug desc and title to reflect new findings. The issue is definitely located in the AHCI driver, as per my previous comments, and it affects all kernels tested (2.6.27-* and 2.6.28-* on intrepid, jaunty beta, jaunty release and debian lenny up to now).

The workaround i used works for 2.6.27 kernels, as ahci is loaded as a module and can be blacklisted in /etc/modprobe.d/blacklist.

Unfortunately, 2.6.28 kernels in ubuntu (haven't tested with debian again) do not load ahci (or any libata driver, for that matter) as modules, but are now built-in into the kernel, so my workaround is useless.

Any help on how to disable ahci when it's built in (ie, force the kernel to use ata_piix anyway) would be greatly appreciated. note that i do not have any intention of recompiling the kernel :)

If you need more information, please let me know.

Revision history for this message
Cedric Schieli (cschieli) wrote :
Download full text (8.8 KiB)

I can confirm this bug on my Dell Vostro 1710 running Jaunty (2.6.28-11-{generic,server}).
It occurs only once on every boot, when GDM starts (and thus does its prefetch), provoking a ~30s blank screen with heavy disk activity before the login screen.

[ 80.040094] ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
[ 80.040106] ata1.00: cmd 60/88:00:25:76:72/00:00:08:00:00/40 tag 0 ncq 69632 in
[ 80.040107] res 40/00:fe:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[ 80.040112] ata1.00: status: { DRDY }
[ 80.040118] ata1.00: cmd 60/08:08:7d:63:d4/00:00:07:00:00/40 tag 1 ncq 4096 in
[ 80.040120] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 80.040124] ata1.00: status: { DRDY }
[ 80.040130] ata1.00: cmd 60/08:10:8d:cb:81/00:00:07:00:00/40 tag 2 ncq 4096 in
[ 80.040132] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 80.040136] ata1.00: status: { DRDY }
[ 80.040142] ata1.00: cmd 61/08:18:b5:64:f8/00:00:07:00:00/40 tag 3 ncq 4096 out
[ 80.040144] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 80.040147] ata1.00: status: { DRDY }
[ 80.040153] ata1.00: cmd 61/08:20:45:e1:f8/00:00:07:00:00/40 tag 4 ncq 4096 out
[ 80.040155] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 80.040159] ata1.00: status: { DRDY }
[ 80.040165] ata1.00: cmd 61/08:28:ad:d4:f9/00:00:07:00:00/40 tag 5 ncq 4096 out
[ 80.040167] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 80.040171] ata1.00: status: { DRDY }
[ 80.040177] ata1.00: cmd 61/08:30:c5:58:fc/00:00:07:00:00/40 tag 6 ncq 4096 out
[ 80.040178] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 80.040188] ata1.00: status: { DRDY }
[ 80.040191] ata1.00: cmd 61/18:38:4d:e1:f8/00:00:07:00:00/40 tag 7 ncq 12288 out
[ 80.040192] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 80.040193] ata1.00: status: { DRDY }
[ 80.040196] ata1.00: cmd 61/10:40:25:fa:f6/00:00:07:00:00/40 tag 8 ncq 8192 out
[ 80.040197] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 80.040199] ata1.00: status: { DRDY }
[ 80.040201] ata1.00: cmd 61/10:48:45:fa:f6/00:00:07:00:00/40 tag 9 ncq 8192 out
[ 80.040202] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 80.040204] ata1.00: status: { DRDY }
[ 80.040207] ata1.00: cmd 61/08:50:5d:fa:f6/00:00:07:00:00/40 tag 10 ncq 4096 out
[ 80.040208] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 80.040210] ata1.00: status: { DRDY }
[ 80.040212] ata1.00: cmd 61/08:58:6d:fa:f6/00:00:07:00:00/40 tag 11 ncq 4096 out
[ 80.040213] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 80.040215] ata1.00: status: { DRDY }
[ 80.040218] ata1.00: cmd 61/08:60:85:fa:f6/00:00:07:00:00/40 tag 12 ncq 4096 out
[ 80.040218] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 80.040220] ata1.00: status: { DRDY }
[ 80.040223] ata1.00: cmd 61/28:68:dd:20:f7/00:00:07:00:00/40 tag 13 ncq 20480 out
[ 80.040224] res 40/00:00:00:00:00/00:00:00:00:00/00 Emas...

Read more...

Revision history for this message
Mihai Tanasescu (mihai-duras) wrote :

Same problem here:

~$ uname -a
Linux Mihai 2.6.28-12-server #43-Ubuntu SMP Fri May 1 20:28:32 UTC 2009 i686 GNU/Linux

Luckily..if I boot the computer with a cd in the drive..then everything works...

Otherwise no luck in reading anything.

Revision history for this message
ShinobiTeno (lct-mail) wrote :

Same problem:

Xubuntu 9.04, Asus A6Rp laptop, CoreDuoT2060, Hitachi HTS541080G9AT00

uname:
2.6.28-11-generic #42-Ubuntu SMP Fri Apr 17 01:57:59 UTC 2009 i686 GNU/Linux

Hitachi > is a PATA drive< , lspci -vv reports:
00:14.1 IDE interface: ATI Technologies Inc IXP SB400 IDE Controller (rev 80) (prog-if 8a [Master SecP PriP])
 Subsystem: ASUSTeK Computer Inc. Device 1397
 Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 64
 Interrupt: pin A routed to IRQ 16
 Region 0: I/O ports at 01f0 [size=8]
 Region 1: I/O ports at 03f4 [size=1]
 Region 2: I/O ports at 0170 [size=8]
 Region 3: I/O ports at 0374 [size=1]
 Region 4: I/O ports at ff00 [size=16]
 Capabilities: [70] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable-
  Address: 00000000 Data: 0000
 Kernel driver in use: pata_atiixp

Yet s(da) naming scheme is used.

Im unable to switch ahci mode off in BIOS(option does not exist).

Symptoms are 100% matching those matching what Gorgonzola wrote on 2009-04-06.
System freezes randomly on boot or on poweroff, at random times and up to several freezes in row.
Sometimes I get dropped to busybox, because of rootfilesystem timeout.

Once X started, system is usable, but disk perfomance is extreme sluggish, dd if=/dev/sda2 of=/dev/null bs=1GB count=1 reports average speed of 10Mb/s....

Can someone point to some kind of temporary fix for this, please? :/

Revision history for this message
ShinobiTeno (lct-mail) wrote :

Add: disk has undergone 48h load and performed long SMART test. SMART table shows no errors or bad/transfered sectors whatsoever.

lspci -vv: http://nopaste.info/0dc0f76223.html
dmesg : http://nopaste.info/3d69c55172.html
hdparm -iI /dev/sda: http://nopaste.info/2667d93861.html
smartctl --all /dev/sda: http://nopaste.info/1f78922cd0.html

Revision history for this message
qwerty (escalantea) wrote :

I had the same problem (SATA disk freezing), and I solved it by tunning the pdflush parameters ... https://bugs.launchpad.net/ubuntu/+bug/270794/comments/12

Revision history for this message
Joachim R. (jro) wrote :

I have the same problem on an IDE DVD writer, it occurs when inserting some writable medium (doesnt occur with CD-ROM or DVD-ROM). After that, it is impossible to eject disk with "eject" command nor with device button.
It could be related as the syslog is the same.

Revision history for this message
Stefan Bader (smb) wrote :

There seem to be multiple issues mixed into one report. The initial report shows problems on generic write commands without any NCQ. The ICH6 controller used seems to use the same PCI ID for both piix and ahci modes (according to the Jaunty code).
The second one fails on DMA read/writes with NCQ involved (so might have a controller in ahci mode).

@José, could you try to compare the behavior against the a recent mainline kernel
https://wiki.ubuntu.com/KernelMainlineBuilds. Also, please supply "sudo lspci -vvvnnxxx" and the full dmesg from the boot.

@Cedric, could you please open another bug report (with ubuntu-bug linux, so it gets automatic information attached). And if possible also compare the results with a recent mainline kernel. Maybe experiment with "libata.force=noncq"

@ShinobiTeno, this is also a different controller. Have you tried to limit DMA speed with "libata.force=udma/66" to see whether this changes things?

@Joachim, this definitely sounds like a different problem. Could you please open another report with "ubuntu-bug linux"? Thanks.

Revision history for this message
José Tomás Atria (jtatria) wrote : Re: [Bug 352197] Re: ata timeout exception with ahci libata driver (was with 2.6.28-11, but i confirmed it affewcts previous kernels too)

@stefan, I have already tested the behaviour with the mainline kernels
closest to 2.6.27-14 and 2.6.28-11. I have not tested any version
after 2.6.28-11. In all this cases, using the ahci module produces the
timeouts, but ata_piix works fine. If only i could tell the 2.6.28
kernels to use ata_piix instead of ahci, i'd take the bug as solved
through a workaround.

If you tell me how i can recover the complete dmesg log from the boot
sequence, i'd be happy to provide one. actually, i'd love it if you
could explain how can i recover the complete boot mesages that
tipically go to tty8 after boot, as the ata timeouts are also printed
there and it allows to see at which point in the boot sequence the
disk starts to hang.

In any case, i'm losing all hope with this bug, and i am still not
sure if its not hardware related, but i have tested with smart tools
and in windows and the drive seems to behave.

thanks for your interest.

--
entia non sunt multiplicanda praeter necessitatem

Revision history for this message
Stefan Bader (smb) wrote : Re: [Bug 352197] Re: ata timeout exception with ahci libata driver (was with 2.6.28-11, but i confirmed it affewcts previous kernels too)

> closest to 2.6.27-14 and 2.6.28-11. I have not tested any version
> after 2.6.28-11. In all this cases, using the ahci module produces the

A mainline kernel farther away (2.6.30) would be of interest as it could show
whether this is an issue upstream.

> timeouts, but ata_piix works fine. If only i could tell the 2.6.28
> kernels to use ata_piix instead of ahci, i'd take the bug as solved
> through a workaround.

It looks a bit strange as your lspci looks like ahci and ata_piix are loaded
but only ata_piix shows up as being the driver. Reading the code for the ahci
driver they have a comment in there that in both modes ahci and ide the
controller reports the same ids. So they do a test to see whether to use ahci
mode or not. I'd like to check with the lspci command mentioned above and the
dmesg what this decision might be. Actually, do you have the ability to change
the controller mode in the BIOS or is that hidden?

> If you tell me how i can recover the complete dmesg log from the boot
> sequence, i'd be happy to provide one. actually, i'd love it if you

I would try whether it is possible to save a dmesg on a USB stick, if that is
possible. But thats probably prevented by the problems with the disk and the
system trying to log the login. Depending on whether you got a second computer,
netconsole might be an option.

> sure if its not hardware related, but i have tested with smart tools
> and in windows and the drive seems to behave.

Windows normally would not use the ahci mode by default. And as you say, you
have no problems when blacklisting the ahci module. The question would be
whether the ahci driver for some reason believes the controller is in ahci mode
while it isn't.

Revision history for this message
Tormod Volden (tormodvolden) wrote :

Just want to add the link to https://wiki.ubuntu.com/KernelTeam/Netconsole which I will also try myself since I might have the same problem (reported in bug 397096).

Revision history for this message
Matthew Geier (matthew-acfr) wrote :

I'll add a 'me too' to this one.
Patched up jaunty sysytem used as a Mythtv box.

System was fine with a 250GB Seagate Bara which failed. I replaced it with a new 1Tb seagate bara and now get these errors under heavy read load.
 2.6.30 doesn't help - I got several errors while the Nvidia installer was running.

 Just moving through the recordings menu in Mythtv will trigger it - really any thing that generate high disk activity seems to.
disk - ST31000528AS

00:1f.2 IDE interface [0101]: Intel Corporation 82801FB/FW (ICH6/ICH6W) SATA Controller [8086:2651] (rev 03) (prog-if 8a [Master SecP PriP])
        Subsystem: Giga-byte Technology Device [1458:2651]
        Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin B routed to IRQ 19
        Region 0: I/O ports at 01f0 [size=8]
        Region 1: I/O ports at 03f4 [size=1]
        Region 2: I/O ports at 0170 [size=8]
        Region 3: I/O ports at 0374 [size=1]
        Region 4: I/O ports at f000 [size=16]
        Capabilities: [70] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Kernel driver in use: ata_piix

Revision history for this message
Rizlaw (rizlaw) wrote :

Another "me, too" for this bug.
Ubuntu 9.04 64bit - SATA AHCI mode - ext4 fs
WD Velociraptor (ext4fs: /, Swap, /Data)
WD Caviar Black 1TB (ext4fs: /home)
WD 640GB (ext4fs: /Backup)

I recently built a new system using an EVGA 760 Classified MB, i7 920cpu, 12GB triple ch DDR3 and 300GB Velociraptor. I installed the OS after setting the BIOS to SATA AHCI mode (needed hot swapping capabilities). Everything worked fine for about 1 month until last night. I was doing a very large (200GB+) grsync backup of my /home partition to my WD640GB backup drive. At some point in the backup the system froze. No combination of keyboard tricks (ctl+alt+backspace, or alt+print screen+REISUB) would restart X or reboot the OS. I had to resort to killing power.

On reboot, the one of my BIOS screens reported "Port 00:" during what I will call the AHCI check. The Velociraptor system drive would not boot. When, I went back into the BIOS setup, I saw that the BIOS did not even see the Velociraptor. I assumed the crash and hard power down had corrputed some critical system files. I removed the Velociraptor boot drive and replaced it with a WinXP 64 boot drive (also set for AHCI mode) and was able to boot up a Windows system with no problem. I then tested the Velociraptor with SpinRite 6 and its a good drive.

I decided to try and reboot the Ubuntu system by changing the BIOS SATA setting to "IDE" mode. This worked. Ubuntu was able to start after it found a few errors on the / partition, corrected them, and finished booting. Ubuntu 9.04/64 now seems to be working fine in "IDE" mode.

After reading through this thread, it seems that I have been bitten by this AHCI bug when I was doing the very large rsync file backup; although, I have to add that this wasn't the first time, I did such a large backup on this new system without problems.

Revision history for this message
Massimo Bilvi (massimo-bilvi) wrote :

Today I have upgraded to 9.04 and I have the same errors by using the kernel 2.6.28-15-generic.
No error with kernel 2.6.24-24-generic.

Revision history for this message
stecklum (stecklum) wrote :

Folks, as many of you I was plagued by the timeout issue, mostly when booting or resuming from standby/hibernate. I am running 8.10 on a Dell XPS M1330. I recently updated the kernel from 2.6.27-12 to 2.6.28-15 and since then the freeze occured more often. I followed the recommendations in this thread which did not cure the problem. I tried blacklisting ahci and use ata_piix in initramfs as well as switching from AHCI to ATA in BIOS. So it seems that the cause is not a fault in the ahci driver.

By chance I came across a similar discussion in the Nvidia forum. There it was suggested that problem might be due to having nvidia and ahci modules on the same IRQ, see http://www.nvnews.net/vbulletin/showthread.php?t=123583 and http://www.nvnews.net/vbulletin/showthread.php?t=126152. I followed the advice by zkmyth to ensure that nvidia does not have to share an interrupt, and this seem to have solved the problem (see attached /proc/interrupts). No freezes since then.

Revision history for this message
stecklum (stecklum) wrote :

lspci-vv follows

Revision history for this message
stecklum (stecklum) wrote :

After two days without freeze my laptop got hit by another one today. So the recipe was not the finale solution, although, overall, it looks like an improvement.

Revision history for this message
Zgth (zygoth) wrote :

I tried to set up Ubuntu Karmic Beta, but faced a similar problem with my Asus P5W DH Deluxe motherboard. AHCI is turned on in BIOS. I tried to load libata module with both ACPI on and off, as it is indicated in many forums, that the "status: { DRDY }" errors may be due to faulty ACPI-implementation in libata, but it didn't help. The console was still being spammed with these messages.

The motherboard has 4 (sort of) SATA ports on the South Bridge, one of which is hard-wired to a Silicon Image 4723 (EZ-RAID) fakeraid controller with 2 ports plus 2 SATA (1 eSATA + 1 normal SATA) ports on the JMicron JMB362/JMB363. I have 4x320GB HDDs on a RAID-5 mdadm-array + 1x1TB HDD for backup and long-term storage.

No matter what HDD I plug into the right port of EZ-RAID controller (turned "off" using jumpers, so that it acts as a pass-through/splitter), I start getting messages like this

[ xx.xxxxxx] ata2.00: exception Emask <...> frozen
<...>
[ xx.xxxxxx] ata2.00: status: { DRDY }
[ xx.xxxxxx] ata2: hard resetting link
[ xx.xxxxxx] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ xx.xxxxxx] ata2.00: configured for UDMA/133
[ xx.xxxxxx] ata2: EH complete

every few minutes.

The problem is solved if I don't use the port provided by Sil 4723, then it just displays

[ 1.804020] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 1.804071] ata2.00: unsupported device, disabling
[ 1.804074] ata2.00: disabled

in the beginning, but that way I get to use only 4 SATA ports (3 from ICH7R + 1 from JMicron) instead of 5.

I tried both server and desktop install CDs, but the problem persists until I stop using the EZ-RAID port.

Revision history for this message
tarabe_22 (tarabe22) wrote :
Download full text (4.2 KiB)

Get so many disk timeouts that make the system unusable, problem encountered with a sony laptop (lspci follows)

root@modugno:~# lspci
00:00.0 Host bridge: Intel Corporation Mobile 915GM/PM/GMS/910GML Express Processor to DRAM Controller (rev 03)
00:02.0 VGA compatible controller: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 03)
00:02.1 Display controller: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 03)
00:1b.0 Audio device: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) High Definition Audio Controller (rev 03)
00:1d.0 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #1 (rev 03)
00:1d.1 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #2 (rev 03)
00:1d.2 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #3 (rev 03)
00:1d.3 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #4 (rev 03)
00:1d.7 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB2 EHCI Controller (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev d3)
00:1f.0 ISA bridge: Intel Corporation 82801FBM (ICH6M) LPC Interface Bridge (rev 03)
00:1f.1 IDE interface: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) IDE Controller (rev 03)
00:1f.2 IDE interface: Intel Corporation 82801FBM (ICH6M) SATA Controller (rev 03)
00:1f.3 SMBus: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) SMBus Controller (rev 03)
06:05.0 CardBus bridge: Texas Instruments PCI7420 CardBus Controller
06:05.2 FireWire (IEEE 1394): Texas Instruments PCI7x20 1394a-2000 OHCI Two-Port PHY/Link-Layer Controller
06:05.3 Mass storage controller: Texas Instruments PCI7420/7620 Combo CardBus, 1394a-2000 OHCI and SD/MS-Pro Controller
06:08.0 Ethernet controller: Intel Corporation 82562ET/EZ/GT/GZ - PRO/100 VE (LOM) Ethernet Controller Mobile (rev 03)
06:0b.0 Network controller: Intel Corporation PRO/Wireless 2200BG Network Connection (rev 05)

root@modugno:~# lspci -vvv -s 00:1f.1
00:1f.1 IDE interface: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) IDE Controller (rev 03) (prog-if 8a [Master SecP PriP])
 Subsystem: Sony Corporation Unknown device 81b9
 Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
 Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
 Latency: 0
 Interrupt: pin B routed to IRQ 21
 Region 0: I/O ports at 01f0 [size=8]
 Region 1: I/O ports at 03f4 [size=1]
 Region 2: I/O ports at 0170 [size=8]
 Region 3: I/O ports at 0374 [size=1]
 Region 4: I/O ports at 1810 [size=16]

root@modugno:~# lspci -vvv -s 00:1f.2
00:1f.2 IDE interface: Intel Corporation 82801FBM (ICH6M) SATA Controller (rev 03) (prog-if 8f [Master SecP SecO PriP PriO])
 Subsystem: Sony Corporation Unknown device 81ba
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
 Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
 Latency: 0
 Interrupt: pin B routed to IRQ 21
 Region 0: I/O ports at 18c8 [size=8]
 Region 1: I/O ports at 1...

Read more...

Revision history for this message
Ivan Dorna (ivan-dorna) wrote :

Mainboard (and controllers):
http://en.gentoo-wiki.com/wiki/Asus_M3A32-MVP_Deluxe/WiFi-AP#ATI_Technologies_Inc_SB600_IDE

system:
Linux bvs003 2.6.28-16-server #55-Ubuntu SMP Tue Oct 20 20:37:10 UTC 2009 x86_64 GNU/Linux

problem still present.

solved some acpi problems (hw related) at boot with: irqfixup pci=routeirq acpi_irq_balance noapic iommu=force

continous freezing of md device under operations and really poor performances.
making some tests, with dd and iostat, dd work and say 103-124mb/s (read) and iostat at same time show a change from 0 mb/s (read) to 0.06-0.09 mb/s (!!!)

something wrong in messaging?

ahci mod from bios (on 4 disks of 6 enabled #2 controller, recognized first controller from os) -> same problem

system on 2 hd with ahci disable (the #1 controller, recognised second form os, have only raid and ide setting) -> same problem

on heavy copy of data (virtual machines) the system progressively freeze and system md device and data destination md device , corrupt the raid. system crash.

no ahci modules loaded, fresh ubuntu server 9.04 (jaunty) installed and upgraded to last packages, on 02/11/2009.

ANY SUGGESTIONS?

Revision history for this message
Scott Robinson (scott-ubuntu) wrote :

I can confirm this bug too. I can trigger is fairly reliably by increasing CPU load.

tags: added: jaunty karmic
Revision history for this message
stecklum (stecklum) wrote :

I recently switched to 2.6.31 and had the impression that the timeout behavior got worse. Anyway, I stumbled about this post http://www.nvnews.net/vbulletin/showthread.php?t=149171 which suggests to use "options nvidia NVreg_EnableMSI=1" in /etc/modprobe.d/options to cure freeze problems. That's what I did and now the timeout issue seems to be gone indeed. It looks it is in fact caused by interrupt problems between the SATA and NVIDIA drivers which go away with MSI. Listed /proc/interrupts for completeness

           CPU0 CPU1
  0: 1184963 71677 IO-APIC-edge timer
  1: 3085 90 IO-APIC-edge i8042
  8: 0 1 IO-APIC-edge rtc0
  9: 8 2 IO-APIC-fasteoi acpi
 12: 10773 3051 IO-APIC-edge i8042
 14: 96639 5469 IO-APIC-edge ata_piix
 15: 0 0 IO-APIC-edge ata_piix
 18: 7 4 IO-APIC-fasteoi mmc0
 19: 155 4 IO-APIC-fasteoi firewire_ohci
 20: 56051 1330 IO-APIC-fasteoi ehci_hcd:usb2, uhci_hcd:usb3, uhci_hcd:usb5
 21: 0 0 IO-APIC-fasteoi uhci_hcd:usb4, uhci_hcd:usb6
 22: 1887 82 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb7
 29: 56631 3275 PCI-MSI-edge ahci
 30: 0 0 PCI-MSI-edge HDA Intel
 31: 5445 9636 PCI-MSI-edge eth0
 32: 64 69 PCI-MSI-edge iwlagn
 33: 186568 1057 PCI-MSI-edge nvidia
NMI: 0 0 Non-maskable interrupts
LOC: 198888 518921 Local timer interrupts
SPU: 0 0 Spurious interrupts
CNT: 0 0 Performance counter interrupts
PND: 0 0 Performance pending work
RES: 146977 286978 Rescheduling interrupts
CAL: 5637 20157 Function call interrupts
TLB: 766 1110 TLB shootdowns
TRM: 0 0 Thermal event interrupts
THR: 0 0 Threshold APIC interrupts
MCE: 0 0 Machine check exceptions
MCP: 36 35 Machine check polls
ERR: 0
MIS: 0

Revision history for this message
stecklum (stecklum) wrote :

Well, it would have been nice but the disk freezes continued, especially when resuming from hibernation while on battery. Meanwhile I am using 2.6.32-21-generic with the same behavior. However, only today I realized the "libata.force=noncq" hint and it seems this kernel argument eventually solved the issue. My M1330 has a SAMSUNG HM500LI hard disk. So it might be better to blacklist it in libata-core.c. In case the disk will timeout again I'll drop a note but hope that won't be necessary. Too bad it took so long...

Revision history for this message
Fred (eldmannen+launchpad) wrote :

Try to update 'pm-utils-powersave-policy' from lucid-proposed repository.

Revision history for this message
Brad Figg (brad-figg) wrote : Unsupported series, setting status to "Won't Fix".

This bug was filed against a series that is no longer supported and so is being marked as Won't Fix. If this issue still exists in a supported series, please file a new bug.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: Triaged → Won't Fix
Tim Gardner (timg-tpi)
tags: removed: intrepid jaunty karmic regression-release
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.