Comment 96 for bug 550559

Revision history for this message
Agesp Anonymous (agespam) wrote :

Same here:

- The previous, aged Ubuntu 9.10 (Karmic Koala) worked without trouble.
- Fresh install from CD: Ubuntu 11.10 (Oneiric Ocelot)

After copying from backups (from several different medias ofcourse) the same situation can be seen:
See http://skalaria.japo.fi/HDD-errors.txt

In my case the machine hung on logout (Ok, i should not repeat all that we already know...let's search for the real culprit)

What the error messages tells to me?:

failed command: READ FPDMA QUEUED
- I understand this is the command is sent to the HD.
- As far as i know/remember, DRDY refers to "Data Ready" signal on microcontrollers. Nothing to it...
- "cmd" is the sent command (shown in HEX there) to the disk controller and the "res" might refer to "response" (as i see it) that is also shown translated in the syslog: "(timeout)"

The latest "timeout" combined the others idea of changing HD drives and power supplies, sounds mostly something like CONTROLLER TIMING PROBLEM (or accompanied POWER problem) to me. Why some other HDD works, is explained by this: they have slightly different timing.

And no, it is not the hardware failing, because it affects everyone with different hardware specs. How the drive is actually controlled? I don't know. But it is because of the kernel update/upgrade, not the hardware change, i'm roughly 100% sure to tell it must be the kernel code that has changed. Nothing else. It is possibly setting/driving the controller wrong, or overlapping the data, think about HDD is already responding. What happens? Bits fall.

What we can do is to get the kernel developers to check the code from when the problem appeared first time. I can confirm that SATA drive is affected, and not PATA. This is critical because it corrupts a data.

Anyway, i could roughly say this problem has nothing to do with:
- Power supply, PATA, SATA, or other HDD drives, replacement of them (several setups)
- the idea of copying data or accessing disk drives in general (f.e. 2 simultaneous cp commands shouldn't affect)
- the type of filesystem (ext3/ext4)
- Static electricity, asteroids or cosmic rays (a connection cut and open pins would - but after Ubuntu upgrade?)

Rather something like:
- HDD drivers (failing kernel code, modules), overlapping HDD handling code?
- The method the disk is accessed
- Bus or code lock-ups/hangs by wrong settings or power control(?)
- Intermittent extra access to HDD, during (large) transfer
- Timing of signals, simultaneous try use of disk by kernel, or added/removed delay from the code

Please note this is my assumption and it is only a ROUGH GUESS what's going on. But We shouldn't be changing hardware because of that.

Shortly: One bit has changed in the kernel. Some code needs to be examined and fixed.