HDD freezes caused by ata exception that results in soft resetting of link
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Expired
|
Undecided
|
Unassigned |
Bug Description
Under even moderately heavy disk writes, I am seeing exceptions like the below in my kern.log
-------
Jun 13 13:33:03 cellar kernel: [66188.434868] ata4.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Jun 13 13:33:03 cellar kernel: [66188.434874] ata4.01: BMDMA stat 0x46
Jun 13 13:33:03 cellar kernel: [66188.434879] ata4.01: failed command: WRITE DMA EXT
Jun 13 13:33:03 cellar kernel: [66188.434886] ata4.01: cmd 35/00:00:
Jun 13 13:33:03 cellar kernel: [66188.434888] res 51/84:01:
Jun 13 13:33:03 cellar kernel: [66188.434892] ata4.01: status: { DRDY ERR }
Jun 13 13:33:03 cellar kernel: [66188.434895] ata4.01: error: { ICRC ABRT }
Jun 13 13:33:03 cellar kernel: [66188.434907] ata4: soft resetting link
Jun 13 13:33:03 cellar kernel: [66188.622000] ata4.01: configured for UDMA/100
Jun 13 13:33:03 cellar kernel: [66188.622013] ata4: EH complete
-------
This is with the latest stable lucid kernel (2.6.32-22-generic #36-Ubuntu).
I've also tried a mainline kernel (2.6.35-020635rc1) & still get the same errors except that there's an additional stack trace:
-------
Jun 14 18:55:40 cellar kernel: [ 152.874172] irq 19: nobody cared (try booting with the "irqpoll" option)
Jun 14 18:55:40 cellar kernel: [ 152.874182] Pid: 0, comm: swapper Tainted: P 2.6.35-
Jun 14 18:55:40 cellar kernel: [ 152.874185] Call Trace:
Jun 14 18:55:40 cellar kernel: [ 152.874198] [<c01a58cc>] __report_
Jun 14 18:55:40 cellar kernel: [ 152.874204] [<c016fee3>] ? sched_clock_
Jun 14 18:55:40 cellar kernel: [ 152.874209] [<c01a5a44>] note_interrupt+
Jun 14 18:55:40 cellar kernel: [ 152.874214] [<c0179da0>] ? tick_nohz_
Jun 14 18:55:40 cellar kernel: [ 152.874219] [<c01a6364>] handle_
Jun 14 18:55:40 cellar kernel: [ 152.874224] [<c0104abf>] handle_
Jun 14 18:55:40 cellar kernel: [ 152.874230] [<c05afefb>] do_IRQ+0x4b/0xc0
Jun 14 18:55:40 cellar kernel: [ 152.874234] [<c01032f0>] common_
Jun 14 18:55:40 cellar kernel: [ 152.874239] [<c010a3a7>] ? mwait_idle+
Jun 14 18:55:40 cellar kernel: [ 152.874243] [<c010189c>] cpu_idle+0x8c/0xc0
Jun 14 18:55:40 cellar kernel: [ 152.874249] [<c05a4337>] start_secondary
Jun 14 18:55:40 cellar kernel: [ 152.874252] handlers:
Jun 14 18:55:40 cellar kernel: [ 152.874254] [<c0431060>] (ata_bmdma_
Jun 14 18:55:40 cellar kernel: [ 152.874261] [<c044fb10>] (usb_hcd_
Jun 14 18:55:40 cellar kernel: [ 152.874268] Disabling IRQ #19
Jun 14 18:56:09 cellar kernel: [ 181.856015] ata4: lost interrupt (Status 0x51)
Jun 14 18:56:09 cellar kernel: [ 181.856034] ata4.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jun 14 18:56:09 cellar kernel: [ 181.856039] ata4.01: BMDMA stat 0x46, BMDMA stat 0x0, BMDMA stat 0x0, BMDMA stat 0x0, BMDMA stat 0x0
Jun 14 18:56:09 cellar kernel: [ 181.856045] ata4.01: failed command: WRITE DMA EXT
Jun 14 18:56:09 cellar kernel: [ 181.856053] ata4.01: cmd 35/00:00:
Jun 14 18:56:09 cellar kernel: [ 181.856054] res 40/00:00:
Jun 14 18:56:09 cellar kernel: [ 181.856058] ata4.01: status: { DRDY }
Jun 14 18:56:09 cellar kernel: [ 181.856072] ata4: soft resetting link
Jun 14 18:56:09 cellar kernel: [ 182.160065] ata4.01: configured for UDMA/133
Jun 14 18:56:09 cellar kernel: [ 182.160072] ata4.01: device reported invalid CHS sector 0
Jun 14 18:56:09 cellar kernel: [ 182.160080] ata4: EH complete
-------
I've tried booting with "libata.
I didn't see these errors in Jaunty. I think they started sometime in Karmic. I upgraded to Lucid in the hopes that the newer release fixed it but no difference.
I think I've ruled out HDD failure. I get these errors on 2 old (3+ years) Seagate 7200.10 disks as well as a brand new Seagate 7200.12 disk.
There are similar bug reports in launchpad but one difference that I noticed is that I consistently see the message "failed command: WRITE DMA EXT" while the other reports fail during a read or some other command.
I can very reliably reproduce the errors by running a rdiff-backup 'restore' operation from an external USB HDD.
== Steps to reproduce ==
1. Boot into Gnome & login
2. Run 'tail -f /var/log/kern.log' in one terminal window
3. Run 'rdiff-backup --force -r now /media/
Within a few seconds, I can see the errors show up in the kernel logs.
Running a fast torrent download will do the trick too.
Since I can reproduce the problem so easily, I'll be very willing to try any special kernel builds to help solve this one.
ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-image-
Regression: Yes
Reproducible: Yes
ProcVersionSign
Uname: Linux 2.6.32-22-generic i686
NonfreeKernelMo
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
Architecture: i386
AudioDevicesInUse:
USER PID ACCESS COMMAND
/dev/snd/
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
Card hw:0 'Intel'/'HDA Intel at 0xf9ffc000 irq 16'
Mixer name : 'Realtek ALC662 rev1'
Components : 'HDA:10ec0662,
Controls : 36
Simple ctrls : 19
Date: Mon Jun 14 19:23:00 2010
HibernationDevice: RESUME=
IwConfig:
lo no wireless extensions.
eth0 no wireless extensions.
MachineType: BIOSTAR Group G31-M7 TE
ProcCmdLine: BOOT_IMAGE=
ProcEnviron:
PATH=(custom, user)
LANG=en_SG.utf8
SHELL=/bin/bash
RelatedPackageV
RfKill:
SourcePackage: linux
dmi.bios.date: 04/10/2009
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 080014
dmi.board.
dmi.board.name: G31-M7 TE
dmi.board.vendor: BIOSTAR Group
dmi.chassis.
dmi.chassis.type: 3
dmi.chassis.vendor: BIOSTAR Group
dmi.modalias: dmi:bvnAmerican
dmi.product.name: G31-M7 TE
dmi.sys.vendor: BIOSTAR Group
Changed in linux (Ubuntu): | |
status: | Incomplete → Triaged |
tags: | added: kernel-core kernel-needs-review |
tags: |
added: kernel-candidate kernel-reviewed removed: kernel-needs-review |
tags: | removed: kernel-candidate |
tags: | added: kernel-candidate |
tags: | removed: kernel-candidate |
Hi Deepak,
If you could also please test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https:/ /wiki.ubuntu. com/KernelMainl ineBuilds . Once you've tested the upstream kernel, please remove the 'needs- upstream- testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs- upstream- testing' text. Please let us know your results.
Thanks in advance.
[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]