HDD freezes caused by ata exception that results in soft resetting of link

Bug #593635 reported by Deepak Sarda
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

Under even moderately heavy disk writes, I am seeing exceptions like the below in my kern.log
-----------------------------------------------
Jun 13 13:33:03 cellar kernel: [66188.434868] ata4.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Jun 13 13:33:03 cellar kernel: [66188.434874] ata4.01: BMDMA stat 0x46
Jun 13 13:33:03 cellar kernel: [66188.434879] ata4.01: failed command: WRITE DMA EXT
Jun 13 13:33:03 cellar kernel: [66188.434886] ata4.01: cmd 35/00:00:00:94:b2/00:04:13:00:00/f0 tag 0 dma 524288 out
Jun 13 13:33:03 cellar kernel: [66188.434888] res 51/84:01:ff:95:b2/84:02:13:00:00/f0 Emask 0x30 (host bus error)
Jun 13 13:33:03 cellar kernel: [66188.434892] ata4.01: status: { DRDY ERR }
Jun 13 13:33:03 cellar kernel: [66188.434895] ata4.01: error: { ICRC ABRT }
Jun 13 13:33:03 cellar kernel: [66188.434907] ata4: soft resetting link
Jun 13 13:33:03 cellar kernel: [66188.622000] ata4.01: configured for UDMA/100
Jun 13 13:33:03 cellar kernel: [66188.622013] ata4: EH complete
----------------------------------------------

This is with the latest stable lucid kernel (2.6.32-22-generic #36-Ubuntu).

I've also tried a mainline kernel (2.6.35-020635rc1) & still get the same errors except that there's an additional stack trace:

-----------------------------------------------

Jun 14 18:55:40 cellar kernel: [ 152.874172] irq 19: nobody cared (try booting with the "irqpoll" option)
Jun 14 18:55:40 cellar kernel: [ 152.874182] Pid: 0, comm: swapper Tainted: P 2.6.35-020635rc1-generic #020635rc1
Jun 14 18:55:40 cellar kernel: [ 152.874185] Call Trace:
Jun 14 18:55:40 cellar kernel: [ 152.874198] [<c01a58cc>] __report_bad_irq+0x2c/0x90
Jun 14 18:55:40 cellar kernel: [ 152.874204] [<c016fee3>] ? sched_clock_tick+0x73/0xa0
Jun 14 18:55:40 cellar kernel: [ 152.874209] [<c01a5a44>] note_interrupt+0xe4/0x120
Jun 14 18:55:40 cellar kernel: [ 152.874214] [<c0179da0>] ? tick_nohz_update_jiffies+0x60/0x70
Jun 14 18:55:40 cellar kernel: [ 152.874219] [<c01a6364>] handle_fasteoi_irq+0x84/0xe0
Jun 14 18:55:40 cellar kernel: [ 152.874224] [<c0104abf>] handle_irq+0x1f/0x30
Jun 14 18:55:40 cellar kernel: [ 152.874230] [<c05afefb>] do_IRQ+0x4b/0xc0
Jun 14 18:55:40 cellar kernel: [ 152.874234] [<c01032f0>] common_interrupt+0x30/0x40
Jun 14 18:55:40 cellar kernel: [ 152.874239] [<c010a3a7>] ? mwait_idle+0x57/0xa0
Jun 14 18:55:40 cellar kernel: [ 152.874243] [<c010189c>] cpu_idle+0x8c/0xc0
Jun 14 18:55:40 cellar kernel: [ 152.874249] [<c05a4337>] start_secondary+0xf7/0x130
Jun 14 18:55:40 cellar kernel: [ 152.874252] handlers:
Jun 14 18:55:40 cellar kernel: [ 152.874254] [<c0431060>] (ata_bmdma_interrupt+0x0/0x190)
Jun 14 18:55:40 cellar kernel: [ 152.874261] [<c044fb10>] (usb_hcd_irq+0x0/0x90)
Jun 14 18:55:40 cellar kernel: [ 152.874268] Disabling IRQ #19
Jun 14 18:56:09 cellar kernel: [ 181.856015] ata4: lost interrupt (Status 0x51)
Jun 14 18:56:09 cellar kernel: [ 181.856034] ata4.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jun 14 18:56:09 cellar kernel: [ 181.856039] ata4.01: BMDMA stat 0x46, BMDMA stat 0x0, BMDMA stat 0x0, BMDMA stat 0x0, BMDMA stat 0x0
Jun 14 18:56:09 cellar kernel: [ 181.856045] ata4.01: failed command: WRITE DMA EXT
Jun 14 18:56:09 cellar kernel: [ 181.856053] ata4.01: cmd 35/00:00:00:84:08/00:04:3b:00:00/f0 tag 0 dma 524288 out
Jun 14 18:56:09 cellar kernel: [ 181.856054] res 40/00:00:00:4f:c2/00:00:00:00:00/50 Emask 0x24 (host bus error)
Jun 14 18:56:09 cellar kernel: [ 181.856058] ata4.01: status: { DRDY }
Jun 14 18:56:09 cellar kernel: [ 181.856072] ata4: soft resetting link
Jun 14 18:56:09 cellar kernel: [ 182.160065] ata4.01: configured for UDMA/133
Jun 14 18:56:09 cellar kernel: [ 182.160072] ata4.01: device reported invalid CHS sector 0
Jun 14 18:56:09 cellar kernel: [ 182.160080] ata4: EH complete
--------------------------------------------------------------------

I've tried booting with "libata.force=noncq" on both kernels (lucid stable & 2.6.35 mainline) but makes no difference.

I didn't see these errors in Jaunty. I think they started sometime in Karmic. I upgraded to Lucid in the hopes that the newer release fixed it but no difference.

I think I've ruled out HDD failure. I get these errors on 2 old (3+ years) Seagate 7200.10 disks as well as a brand new Seagate 7200.12 disk.

There are similar bug reports in launchpad but one difference that I noticed is that I consistently see the message "failed command: WRITE DMA EXT" while the other reports fail during a read or some other command.

I can very reliably reproduce the errors by running a rdiff-backup 'restore' operation from an external USB HDD.

== Steps to reproduce ==
1. Boot into Gnome & login
2. Run 'tail -f /var/log/kern.log' in one terminal window
3. Run 'rdiff-backup --force -r now /media/freeagent/share /share/' in another terminal

Within a few seconds, I can see the errors show up in the kernel logs.

Running a fast torrent download will do the trick too.

Since I can reproduce the problem so easily, I'll be very willing to try any special kernel builds to help solve this one.

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-image-2.6.32-22-generic 2.6.32-22.36
Regression: Yes
Reproducible: Yes
ProcVersionSignature: Ubuntu 2.6.32-22.36-generic 2.6.32.11+drm33.2
Uname: Linux 2.6.32-22-generic i686
NonfreeKernelModules: nvidia
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
Architecture: i386
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: antrix 1387 F.... pulseaudio
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
 Card hw:0 'Intel'/'HDA Intel at 0xf9ffc000 irq 16'
   Mixer name : 'Realtek ALC662 rev1'
   Components : 'HDA:10ec0662,15650000,00100101'
   Controls : 36
   Simple ctrls : 19
Date: Mon Jun 14 19:23:00 2010
HibernationDevice: RESUME=UUID=c6dab799-13a8-443e-b2a3-4b93f3bbb42e
IwConfig:
 lo no wireless extensions.

 eth0 no wireless extensions.
MachineType: BIOSTAR Group G31-M7 TE
ProcCmdLine: BOOT_IMAGE=/vmlinuz-2.6.32-22-generic root=UUID=466535ad-0b59-4fd0-b18b-ba486150f91a ro quiet splash
ProcEnviron:
 PATH=(custom, user)
 LANG=en_SG.utf8
 SHELL=/bin/bash
RelatedPackageVersions: linux-firmware 1.34
RfKill:

SourcePackage: linux
dmi.bios.date: 04/10/2009
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 080014
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: G31-M7 TE
dmi.board.vendor: BIOSTAR Group
dmi.chassis.asset.tag: None
dmi.chassis.type: 3
dmi.chassis.vendor: BIOSTAR Group
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr080014:bd04/10/2009:svnBIOSTARGroup:pnG31-M7TE:pvr:rvnBIOSTARGroup:rnG31-M7TE:rvr:cvnBIOSTARGroup:ct3:cvr:
dmi.product.name: G31-M7 TE
dmi.sys.vendor: BIOSTAR Group

Revision history for this message
Deepak Sarda (antrix) wrote :
Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi Deepak,

If you could also please test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Deepak Sarda (antrix) wrote :

Like I said above, I tested with 2.6.35-020635rc1 kernel downloaded from http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.35-rc1-lucid/ and got the stack trace as reported above.

Let me know if I should test against some other mainline version.

tags: removed: needs-upstream-testing
Revision history for this message
Deepak Sarda (antrix) wrote :

I installed the 2.6.31-02063112-generic kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.31.12/ and the errors are gone!

So the regression was introduced after that kernel. What/How else can I test to isolate this bug?

Changed in linux (Ubuntu):
status: Incomplete → Triaged
tags: added: kernel-core kernel-needs-review
Andy Whitcroft (apw)
tags: added: kernel-candidate kernel-reviewed
removed: kernel-needs-review
tags: removed: kernel-candidate
tags: added: kernel-candidate
tags: removed: kernel-candidate
Revision history for this message
Andy Whitcroft (apw) wrote :

I have a feeling that there were patches related to errors being over reported and leading to hangs and/or errors. The fixes for those hit mainline and maverick relativly recently. I would therefore suggest you test the latest maverick kernel which is based on v2.6.35-rc6 and report back here. Thanks.

Revision history for this message
penalvch (penalvch) wrote :

Deepak Sarda, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? If so, could you please test for this with the latest development release of Ubuntu? ISO images are available from http://cdimage.ubuntu.com/daily-live/current/ .

If it remains an issue, could you please run the following command in the development release from a Terminal (Applications->Accessories->Terminal), as it will automatically gather and attach updated debug information to this report:

apport-collect -p linux <replace-with-bug-number>

Also, could you please test the latest upstream kernel available following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Please do not test the daily kernel folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-v3.12-rc2

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

Changed in linux (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.