Repetitive massive filesystem corruption

Bug #528981 reported by Scott Testerman
82
This bug affects 14 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Andy Whitcroft
Lucid
Won't Fix
High
Unassigned

Bug Description

This problem has been ongoing since Kubuntu 9.10 but I was unable to take the time to properly diagnose the error during that product cycle (and I hoped a newer kernel would solve the problem). I can also report that the problem is reproducible in openSUSE 11.2, so this is possibly even a mainline kernel problem.

In short, I experience massive filesystem corruption on ext4. In 9.10 I could also reproduce the corruption using ext3, but I have not yet had a chance to test ext3 under 10.04. I was also able to reproduce the problem on 9.10 by mounting from Live CD and creating a filesystem on an internal hard drive. The symptions on 10.04 are so far identical, but I haven't yet been able to perform all the testing I did under 9.10. I have attached an example fsck log from current 10.04.

Corruption is sometimes detected when booting; other times the filesystem switches to read-only while the system is running. The latter is what happened on the first boot of my current installation of 10.04. A subsequent boot left me able to perform all system updates. Then after rebooting the system the filesystem switched to read-only within a few minutes of logging in. The only cure for the read-only filesystem is to boot from CD and run e2fsck manually.

Since October, I have performed some hardware troubleshooting (many, many times):
1) Hitachi Drive Fitness Test, which always reports Disposition Code 00: no errors;
2) badblocks: always reports no bad blocks found, regardless of the blocksize, etc.

SMART reports the drive status as good, with no serious errors reported. Memtest 86+ says the RAM is good. The machine runs flawlessly under 9.04 or earlier using ext3, and NTFS produces no similar corruption under either Windows XP or Windows 7.

As I enter this bug report, I've been running with no errors for the last 30 minutes, and there's no way to predict if/when corruption will reappear. As a further troubleshooting step I have turned write caching off using the Hitachi Feature Tool; I previously did this using hdparm.conf, but IMHO write caching shouldn't be blamed for any corruption during normal system operation, but only after an unclean shutdown.

I should add that I've seen bug reports with quite similar symptoms on 64-bit systems, but I should emphasize that I'm running 32-bit on this system, with an ICH4-M controller. The corruption problem hasn't surfaced on any of my either systems, which include both Intel and VIA controllers.

ProblemType: Bug
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
Architecture: i386
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: scott 1372 F.... knotify4
                      scott 1401 F.... kmix
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
 Card hw:0 'I82801DBICH4'/'Intel 82801DB-ICH4 with STAC9752,53 at irq 17'
   Mixer name : 'SigmaTel STAC9752,53'
   Components : 'AC97a:83847652'
   Controls : 38
   Simple ctrls : 24
Date: Sat Feb 27 07:45:15 2010
DistroRelease: Ubuntu 10.04
Frequency: This has only happened once.
HibernationDevice: RESUME=UUID=37bfb722-ecb9-4971-8ba0-59a0ee38cadf
InstallationMedia: Kubuntu 10.04 "Lucid Lynx" - Alpha i386 (20100225)
Lsusb:
 Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 001 Device 002: ID 07cc:0301 Carry Computer Eng., Co., Ltd 6-in-1 Card Reader
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Gateway Gateway M350WVN
Package: linux-image-2.6.32-14-generic (not installed)
PccardctlIdent:
 Socket 0:
   no product info available
PccardctlStatus:
 Socket 0:
   no card
ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.32-14-generic root=UUID=85bd25c3-6f72-4760-b47e-23436238910e ro quiet splash
ProcEnviron:
 LANGUAGE=
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.32-14.20-generic
Regression: No
RelatedPackageVersions: linux-firmware N/A
Reproducible: No
RfKill:
 0: phy0: Wireless LAN
  Soft blocked: no
  Hard blocked: no
SourcePackage: linux
TestedUpstream: No
Uname: Linux 2.6.32-14-generic i686
dmi.bios.date: 04/23/2004
dmi.bios.vendor: Gateway
dmi.bios.version: 34.01.00
dmi.board.name: Gateway M350WVN
dmi.board.vendor: Gateway
dmi.board.version: Rev 1.0
dmi.chassis.asset.tag: No Asset Tag
dmi.chassis.type: 10
dmi.chassis.vendor: Gateway
dmi.chassis.version: QCIPRC
dmi.modalias: dmi:bvnGateway:bvr34.01.00:bd04/23/2004:svnGateway:pnGatewayM350WVN:pvrRev1:rvnGateway:rnGatewayM350WVN:rvrRev1.0:cvnGateway:ct10:cvrQCIPRC:
dmi.product.name: Gateway M350WVN
dmi.product.version: Rev 1
dmi.sys.vendor: Gateway

Revision history for this message
Scott Testerman (scott-testerman) wrote :
Revision history for this message
Scott Testerman (scott-testerman) wrote :

Once again, the filesystem switched to read-only while the system was in use. The resulting fsck log is attached. This is using the standard 2.6.32-14-generic kernel immediately after a fresh install. Total uptime before the problem was discovered was under five minutes.

Revision history for this message
Scott Testerman (scott-testerman) wrote :

Using mainline kernel 2.6.32.8 results in the same sort of corruption, but it appears to be somewhat less pronounced. Unfortunately, the corruption also blew up the kernel itself, so the system could no longer boot. Since further blind experimentation seems pointless, I've reinstalled Kubuntu 9.04 until I receive some further suggestions regarding steps to try next.

Revision history for this message
Stuart (stuartneilson) wrote :

I have very similar symptoms following a fresh install of Ubuntu 9.10 on a Dell Inspiron 1501, which I have installed once in a single ext3 partition and subsequently in two (/ and /home) ext3 partitions, with the same symptoms both times. I have records of an orphan node cleanup in logs (similar to "[ 6.974972] EXT3-fs: INFO: recovery required on readonly filesystem." described by another user at http://ubuntuforums.org/archive/index.php/t-1180159.html). There is no record of the events leading to the filesystem becoming readonly (obviously).

Possibly related is the bug report "ext4 journal error, remounted read-only after resume", https://bugs.launchpad.net/ubuntu/+source/linux/+bug/438379.

Revision history for this message
Scott Testerman (scott-testerman) wrote :

Bug 438379 does look quite similar to my problem, but I should note that I've never suspended nor resumed before experiencing fs corruption, so suspend/resume issues may or may not be relevant.

Revision history for this message
Scott Testerman (scott-testerman) wrote :

As an additional troubleshooting step, I decided to run memtest86+ to verify that my RAM is OK, and the result was that everything checks out just fine.

Revision history for this message
Scott Testerman (scott-testerman) wrote :

I have now tested mainline kernel 2.6.33-999.201003071003 (i386) and have experienced absolutely NO filesystem corruption under it.

Rebooting from 2.6.33 to 2.6.32-16.24 resulted in the following error:

EXT4-fs error (device sda1): ext4_lookup; deleted inode referenced: 3409181
Aborting journal on device sda1-8.
EXT4-fs error (device sda1): ext4_journal_start_sb: Detected aborted journal
EXT4-fs (sda1): Remounting filesystem read-only
EXT4-fs (sda1): Remounting filesystem read-only

This last error has occurred only once, and followed a clean shutdown. Using the Alternate CD recovery mode to run e2fsck allowed journal replay, and no further errors have been detected under 2.6.32-16.24.

Revision history for this message
Scott Testerman (scott-testerman) wrote :

More errors related to 2.6.32-16.24, but they happened after an unexplained system lockup. Unfortunately, the error also destroyed Firefox, which was the open application when the lockup occurred. Fortunately, enough of the system was still functioning that error messages were logged at the end of dmesg:

[ 262.957715] EXT4-fs error (device sda1): ext4_mb_generate_buddy: EXT4-fs: group 572: 155 blocks in bitmap, 280 in gd
[ 262.957734] Aborting journal on device sda1-8.
[ 262.960225] EXT4-fs error (device sda1): ext4_journal_start_sb: Detected aborted journal
[ 262.960240] EXT4-fs (sda1): Remounting filesystem read-only
[ 262.985668] EXT4-fs (sda1): Remounting filesystem read-only
[ 263.028132] EXT4-fs error (device sda1): ext4_mb_generate_buddy: EXT4-fs: group 576: 169 blocks in bitmap, 320 in gd
[ 263.065760] EXT4-fs error (device sda1): ext4_mb_generate_buddy: EXT4-fs: group 633: 242 blocks in bitmap, 479 in gd
[ 263.066814] EXT4-fs error (device sda1): ext4_mb_generate_buddy: EXT4-fs: group 638: 100 blocks in bitmap, 318 in gd
[ 263.077529] EXT4-fs error (device sda1): ext4_mb_generate_buddy: EXT4-fs: group 644: 276 blocks in bitmap, 503 in gd
[ 263.078601] EXT4-fs error (device sda1): ext4_mb_generate_buddy: EXT4-fs: group 649: 317 blocks in bitmap, 601 in gd
[ 263.079040] EXT4-fs error (device sda1): ext4_mb_generate_buddy: EXT4-fs: group 651: 313 blocks in bitmap, 619 in gd
[ 263.079679] EXT4-fs error (device sda1): ext4_mb_generate_buddy: EXT4-fs: group 654: 390 blocks in bitmap, 691 in gd
[ 263.079906] EXT4-fs error (device sda1): ext4_mb_generate_buddy: EXT4-fs: group 655: 347 blocks in bitmap, 624 in gd
[ 263.089608] EXT4-fs error (device sda1): ext4_mb_generate_buddy: EXT4-fs: group 661: 427 blocks in bitmap, 839 in gd
[ 263.090097] EXT4-fs error (device sda1): ext4_mb_generate_buddy: EXT4-fs: group 665: 471 blocks in bitmap, 824 in gd
[ 263.090327] EXT4-fs error (device sda1): ext4_mb_generate_buddy: EXT4-fs: group 666: 447 blocks in bitmap, 877 in gd
[ 263.090681] EXT4-fs (sda1): delayed block allocation failed for inode 4588580 at logical offset 0 with max blocks 3534 with error -30

Revision history for this message
Scott Testerman (scott-testerman) wrote :

And again on EXT3, following a fresh install, with all packages updated to latest available. This is running the 2.6.32-16 kernel.

[ 329.714992] EXT3-fs error (device sda1): htree_dirblock_to_tree: bad entry in directory #7192587: rec_len % 4 != 0 - offset=0, inode=539167267, rec_len=28483, name_len=112
[ 329.715006] Aborting journal on device sda1.
[ 329.716290] ext3_abort called.
[ 329.716298] EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
[ 329.716303] Remounting filesystem read-only
[ 329.826604] Remounting filesystem read-only

tags: added: karmic
Revision history for this message
Scott Testerman (scott-testerman) wrote :

Still happens under 2.6.32-17-generic #26-Ubuntu. It's now much less pronounced, but the system more frequently hard locks when the filesystem goes read-only, and so requires a power cycle to restart. This situation also leads to the requirement to run e2fsck from a CD.

Revision history for this message
Scott Testerman (scott-testerman) wrote :

Have now been running kernel 2.6.33-020633-generic for over 24 hours with no read-only events. Even better, after the system locks up due to various i855 xorg problems, on reboot the machine is experiencing no data loss at all. Seems clear the data loss issue has been resolved upstream.

Revision history for this message
Scott Testerman (scott-testerman) wrote :

Kernel 2.6.32-18-generic no longer has spontaneous corruption at all. When the system hard locks due to other problems (an xserver lockup, for instance), the system still requires running e2fsck from a CD, and can still experience significant corruption. This is distinctly different from 2.6.33, which has only very minor, understandable, errors under the same conditions.

Revision history for this message
Scott Testerman (scott-testerman) wrote :

Debian Squeeze kernel 2.6.32-3-686 and kernel 2.6.32-trunk-686 do not exhibit any corruption at all. This leads me to believe the problem is a configuration problem, and not a problem inherent to the 2.6.32 kernel itself.

Revision history for this message
Scott Testerman (scott-testerman) wrote :

Still running Debian Squeeze with kernel 2.6.32-3-686, and still no corruption. The Debian configuration appears to be using CONFIG_IDE instead of CONFIG_ATA, since the hard drive is reported as hda rather than sda.

I've found an upstream bug report / flame war that may or may not provide any useful information, indicating that some users may have experienced this problem as far back as kernel 2.6.28 (although that Ubuntu kernel still works with no corruption on my system. Read at risk of your own sanity: https://bugzilla.kernel.org/show_bug.cgi?id=13365

tags: removed: needs-upstream-testing
Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi Scott,

If you could also please test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Scott Testerman (scott-testerman) wrote :

Thanks Jeremy!

By "latest upstream," are you referring to 2.6.33.2-lucid (the series I've already tested with success) or 2.6.34-rc5-lucid (which I haven't tested at all)?

Please allow a few days for me to perform the testing and I'll report back here.

Revision history for this message
Stuart (stuartneilson) wrote :

I am still having this issue in a fresh install of the release candidate of Ubuntu 10.04, using the default kernel. Both the root partition and separate home partition are remounted readonly, so pretty much everything is disabled (including the ability to save any log data).

Revision history for this message
Scott Testerman (scott-testerman) wrote :

Thanks for your update, Stuart. In trying to save log files when the filesystem goes read-only, I've found it helpful to use either a USB key or SD Card, which can be mounted even when the boot filesystem has gone read-only. In most cases when testing for this problem, I even mount an SD Card as soon as my system boots so I can immediately capture the data when the (inevitable) corruption happens. You can then do something like "dmesg > /dev/sdc1/dmesg.txt" to capture data about the event, unmount the USB key, reboot and submit an update to this bug report.

You mentioned the RC; I'm guessing you're using kernel 2.6.32-21. We've been asked to also try using the "latest" upstream kernel. I've already tried the 2.6.33 series, which seems to solve the problem, but 2.6.34 is also available. Can you please help by getting an upstream kernel (either 2.6.33.3-lucid or 2.6.34-rc5-lucid) from here: http://kernel.ubuntu.com/~kernel-ppa/mainline . Full instructions and complete information about the process are at https://wiki.ubuntu.com/KernelMainlineBuilds .

You have my personal thanks for continuing to help on this problem, since this corruption problem is extremely frustrating!

Revision history for this message
Stuart (stuartneilson) wrote :

I have tried inserting a USB memory stick, but it seems that the system is unable to mount it once the root partition has been remounted readonly. Is there any command to remount the filesystem rw, at my own risk?

I also collected some possibly duplicate launchpad bugs here http://www.iol.ie/~stuartneilson/Bootup_fsck.html

and I will try the upstream kernel later on today.

Revision history for this message
Scott Testerman (scott-testerman) wrote :

I would strongly advise against trying to force your primary filesystem to remount writeable, especially since it may introduce additional corruption that will make this bug more difficult to troubleshoot. Filesystem bugs like this one are quite difficult to troubleshoot because, for instance, this bug is appearing on a very common hard drive controller and chipset, but it appears on a very small subset of the total number of devices with that chipset.

You can manually mount your USB stick using something like "sudo mount -t vfat /dev/sdc1 /mnt" (replacing /dev/sdc1 with the correct location for your particular setup). You can find out where your system sees your USB filesystem by first using "sudo fdisk -l". There's a more comprehensive wiki page available in Ubuntu Community Help: https://help.ubuntu.com/community/Mount/USB .

Also, thanks for the work on the possible duplicates. That list may help the developers, although I note there are problems reported with Ubuntu 9.04, which never caused problems for me.

Revision history for this message
Ben Regenspan (bregenspan) wrote :

I'm experiencing what appears to be the same issue, on a Dell XPS M1330. So far the problem appears to be fixed running http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.34-rc6-lucid/

Revision history for this message
Stuart (stuartneilson) wrote :

This message clearly does not apply directly to all subscribers to this bug, because some have NVidia graphics. However I noticed that I had an error in dmesg "[drm:rs400_gart_adjust_size] *ERROR* Forcing to 32M GART size (because of ASIC bug ?)" and that this is solved in Bug #562843 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/562843

The solution to Bug #562843 is to add "radeon.modeset=0" to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub and then run sudo update-grub2.

I have repeatedly suspended and resumed, by closing the lid and from the menu, without a single failure to resume since applying this boot option. (That may be just luck - I will post when my system freezes or fails to resume).

Revision history for this message
Scott Testerman (scott-testerman) wrote :

Running kernel 2.6.34-020634rc6-generic (mainline 2.6.34-rc6-lucid) I am experiencing exactly the same behavior as I previously reported in comment #7. It seems clear that the problem was fixed in 2.6.33.

tags: removed: needs-upstream-testing
Revision history for this message
Stuart (stuartneilson) wrote :

The grub option "radeon.modeset=0" does not prevent X freezes. My system appears to last much longer between freezes / failures to resume, but still does so.

I can repeatably cause the machine to freeze by mounting cifs locations - the computer will freeze shortly into use after the next resume following the mount operation.

High system load, e.g. generating md5 checksums on a CD image while copying large files, will also cause an X freeze.

In all cases the keyboard and mouse appear unresponsive, caps lock does not alter the LED state, but Alt SysReq REISUB will reboot.

Revision history for this message
Stuart (stuartneilson) wrote :

Bugzilla bug #14543 describes a very similar issue involving ata flushing which is RESOLVED and available in patch "libata: retry failed FLUSH if device didn't fail it" which is applied from kernel 2.6.33-rc1
http://mirror.celinuxforum.org/gitstat//commit-detail.php?commit=6013efd8860bf15c1f86f365332642cfe557152f

The bug report https://bugzilla.kernel.org/show_bug.cgi?id=14543 also has a script that repeatably creates the read-only effect, but I can not get this script to create the effect here - can anyone create a script that reliably replicates this fault?

Comment #7 From Andrey Vihrov 2009-11-05 09:43:18 -------

Created an attachment (id=23658) [details]
dmesg with ext4 failure

Yes. I was able to trigger it by running two scripts in parallel:

while true; do
    echo > test
    && sync
    && rm test
    && sync
    || break ;
done

and

while true; do
    for FILE in /sys/class/scsi_host/*/link_power_management_policy; do echo
"min_power" > ${FILE} && echo "max_performance" > ${FILE}; done
    && sleep 1 ;
done

Revision history for this message
Scott Testerman (scott-testerman) wrote :

No further patches are needed regarding the reported bug. Kernel 2.6.33 has been released and does not experience the bug described in this report. Kernel 2.6.34 is also released (and queued for inclusion in Maverick) and does not experience the bug described in this report.

You can use the mainline 2.6.33 kernel (which is now in its fourth point-revision) by downloading the appropriate files here:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.33.4-lucid/

You can use the mainline 2.6.34 kernel (which has had no point-revisions) by downloading the appropriate files here:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.34-lucid/

For instructions on how to determine which files you need and how to install them, visit:
https://wiki.ubuntu.com/KernelTeam/MainlineBuilds

These are current as of the time I'm entering this comment, but you may wish to check the Mainline Kernel PPA for newer point-revisions so you always have the latest mainline kernel.

Revision history for this message
chastell (chastell) wrote :

Thanks a lot, Scott, for your through summary.

Is there a good way to install Lucid with the mainline kernels without ever booting with kernels affected by this issue, or would this require a custom install image (with one of the mainline kernels on the installation medium, and one being able to install the mainline kernel before rebooting into the installed system)?

Revision history for this message
Scott Testerman (scott-testerman) wrote :

Everyone is welcome to the fruits of my pain with this problem. I was hoping to get the problem solved before Lucid so I could use a completely standard Lucid install, but my hopes were dashed and I have to target Maverick instead.

The current stock Lucid kernel is just stable enough on my system that I am able to perform an installation as follows:

1) Download the mainline kernel files (you will need 3 for your system) and put them on some kind of media that you can access without a graphical system.

2) Use the Alternate Install CD to install a Command Line Only system.

3) Reboot after installation, login to your new system, and IMMEDIATELY mount the media with your mainline kernel. The easiest way to do this is something like "sudo mount /dev/sdc1 /mnt" (and replace sdc1 with the location of your media).

4) Install the mainline kernel: "cd /mnt" and then "sudo dpkg -i *.deb"

5) Reboot the system.

5a) If you know how to do it, edit your /etc/apt/sources.list to enable the CD-ROM (or use apt-cdrom). Otherwise, you need a decently fast Internet connection from here on.

6) Now install the full system, using whatever method you prefer. Tasksel has been bombing out on me, so I would recommend either "sudo aptitude install kubuntu-desktop" or "sudo apt-get install kubuntu-desktop". Replace kubuntu-desktop with the variant of Ubuntu you prefer.

7) The mainline kernel will automatically be the first kernel that GRUB sees, so you will always boot into mainline unless you hold down the Shift key at boot time and manually select another kernel.

This method is not perfect, but it DOES give me a working Lucid system. I had to skip Karmic completely because the mainline kernels at the time didn't work for me. The biggest drawback is that this method pretty much requires a fast Internet connection, so if you don't have one then this is not the method for you.

A piece of advice: if you get the read-only filesystem indicating corruption at any point before you boot into the mainline kernel, you should probably go ahead and reinstall rather than wasting time by trying to fix the broken installation.

You should be aware that the mainline kernel still has issues with Intel 852/855 video, so if you have this chipset you can still expect frequent hard lockups, but fortunately your entire filesystem will not be corrupted any more when that happens. Don't blame Ubuntu for these problems though, because Ubuntu appears to have about the only reasonably functioning solution of any major distro at the moment. The Intel video problem is an upstream headache, and they are beating their heads against brick walls trying to solve it. More information is in Bug 541511.

You should also be aware that the Lucid version of ndiswrapper does not work with 2.6.33 and later kernels, but this has been solved for Maverick. If you need ndiswrapper for any reason, then the mainline kernel may cause heartache for you. More information is in Bug 582555.

Revision history for this message
Dr. Burnett (cortezb3) wrote :

I am having the same (read-only file system) problem after wake from sleep triggered by opening the lid on the laptop. I have included all the useful information I gathered from logs. I will also, as previously described, attempt to capture more relevant data if/when phenomenon occurs again.

##mount ->
/dev/sda5 on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
none on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
none on /dev type devtmpfs (rw,mode=0755)
none on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
none on /dev/shm type tmpfs (rw,nosuid,nodev)
none on /var/run type tmpfs (rw,nosuid,mode=0755)
none on /var/lock type tmpfs (rw,noexec,nosuid,nodev)
none on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
none on /var/lib/ureadahead/debugfs type debugfs (rw,relatime)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,noexec,nosuid,nodev)
/home/burnett/.Private on /home/burnett type ecryptfs (ecryptfs_sig=848fa9a5a2c8207f,ecryptfs_fnek_sig=28ea3f06089bd9f2,ecryptfs_cipher=aes,ecryptfs_key_bytes=16)
gvfs-fuse-daemon on /home/burnett/.gvfs type fuse.gvfs-fuse-daemon (rw,nosuid,nodev,user=burnett)

2.6.32-22-generic (64-bit)
Ubuntu 10.04 LTS

##hdparm -i /dev/sda ->
 Model=Hitachi, FwRev=PB4OC60F, SerialNo=091117PB4406Q7CRZLUL
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
 BuffType=DualPortCache, BuffSize=7208kB, MaxMultSect=16, MultSect=off
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=976773168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes: pio0 pio1 pio2 pio3 pio4
 DMA modes: mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
 AdvancedPM=yes: mode=0xFE (254) WriteCache=enabled
 Drive conforms to: unknown: ATA/ATAPI-2,3,4,5,6,7

 * signifies the current active mode

Revision history for this message
Dr. Burnett (cortezb3) wrote :

Just saw Scott's post and I am applying it now. Will report any further problems.
Thanks.

Revision history for this message
Scott Testerman (scott-testerman) wrote :

Dr. Burnett:
Please note that this bug report does not involve a wake from sleep trigger. The corruption events reported here are spontaneous, with the system up and running.

If your filesystem is being corrupted on wake from sleep, please file a new bug report with the circumstances of the different bug you have found.

Revision history for this message
Dr. Burnett (cortezb3) wrote :

Thanks Scott. I will check to see if a bug report has already been submitted on this issue. It appears that the kernel upgrade was successful in stabilizing my system, so thanks anyway.

Revision history for this message
Scott Testerman (scott-testerman) wrote :

I would like to report that I've been using the Ubuntu kernel 2.6.34-52.1 from Brian Rogers's PPA located at

https://launchpad.net/~brian-rogers/+archive/graphics-fixes

with considerable success. For those who use an i852/855 chipset and are having problems with the Intel graphics, in addition to the filesystem corruption problem, this is worth a try.

Although the updated kernel is not 100% stable, the increase in stability over even the mainline 2.6.34 kernel is quite welcome. Now not only are video-related crashes much less frequent for me, but when the system does crash, I don't lose the entire filesystem.

Instructions for installing are at the above link, with the usual caveat that this method is completely and utterly unsupported by Ubuntu, Canonical, and probably even Brian Rogers, so beware that using the PPA kernel could result in your entire system turning into a block of semi-sentient Swiss cheese and walking away to take a Carnival cruise.

tags: added: kernel-fs kernel-needs-review
Changed in linux (Ubuntu):
importance: Undecided → High
status: Incomplete → Triaged
Andy Whitcroft (apw)
tags: added: kernel-candidate kernel-reviewed
removed: kernel-needs-review
Revision history for this message
Andy Whitcroft (apw) wrote :

Ok I have pulled back the patch which was suggested in comment #25, and built some test kernels. If anyone is able to test these and confirm whether this fix is indeed the solution that would be helpful. Of course there is significant risk it is not, so don't do it with your favourite data. If you could test the kernels at the URL below and report back here that would be helpful:

    http://people.canonical.com/~apw/lp528981-lucid/

Thanks!

Changed in linux (Ubuntu):
assignee: nobody → Andy Whitcroft (apw)
status: Triaged → Incomplete
Changed in linux (Ubuntu Lucid):
status: New → Incomplete
importance: Undecided → High
assignee: nobody → Andy Whitcroft (apw)
Changed in linux (Ubuntu):
status: Incomplete → Fix Released
Revision history for this message
Andy Whitcroft (apw) wrote :

As this issue is not seen in 2.6.34 I am marking this Fix Released for Maverick. Lucid remains open.

Revision history for this message
Stuart (stuartneilson) wrote :

Andy Whitcroft, if I understood message #34 correctly, then the kernel 2.6.32-22 contains the libata patch. I am running with this kernel now in Lucid on a laptop previously affected by this bug. I will let you know after 24 hours if it is still running.

I was running kernel 2.6.34 previously and that has been stable since May 19th, without a single freeze, resume failure, corrupted file or fsck on reboot. A version of 2.6.33 has been running the same duration on an identical laptop with Karmic, also without any failure.

Revision history for this message
Stuart (stuartneilson) wrote :

Just 50 minutes in and I suspended (to go to bed) and just tried testing resume, various parts of the desktop failed (Network Manager for one) and the filesystem was readonly. Attempting to open a terminal locked it completely - the Caps Lock light did not function, but Alt Sysreq REISUB rebooted, providing the same old fsck message (with no files in /lost+found):

[ 3.133565] EXT4-fs (sda4): INFO: recovery required on readonly filesystem
[ 3.133573] EXT4-fs (sda4): write access will be enabled during recovery
[ 3.909630] EXT4-fs (sda4): orphan cleanup on readonly fs
...
[ 3.909731] EXT4-fs (sda4): 6 orphan inodes deleted
[ 3.909735] EXT4-fs (sda4): recovery complete
[ 4.515465] EXT4-fs (sda4): mounted filesystem with ordered data mode

Revision history for this message
Stuart (stuartneilson) wrote :

... and the updated kernel 2.6.32-22-generic pushed out by update today caused a freeze, after which I had to manually correct the filesystem (twice). The first run reported empty indoes in /tmp/stuart-orbit/ and now, for the first time, I have some recovered inodes:

ls -al /lost+found/
total 20
drwx------ 2 root root 16384 2010-04-29 23:25 .
drwxr-xr-x 22 root root 4096 2010-05-20 07:50 ..
srwxr-xr-x 1 stuart stuart 0 2010-06-04 12:00 #519190
srwxr-xr-x 1 stuart stuart 0 2010-06-04 12:00 #519199
srwxr-xr-x 1 stuart stuart 0 2010-06-04 12:00 #519211

So no, neither the kernel posted here last night nor the 2.6.32-22 pushed by Update functioned correctly on my system. Kernel 2.6.34 (Lucid) and 2.6.33 (Karmic) have both run from 19 May to 3 June without an error.

Revision history for this message
Grant Likely (glikely) wrote :

Hi Andy,

I'm also seeing this issue on my MacBook 2,1 with an Intel SSD. Typically it will go for long periods of time (days) without a problem before seeing the failure. Often times it is when switching to a different commit on my Linux-2.6 git tree that the failure will occur. I've now switched to Brian Rodgers linux-2.6.34-v9patch-generic kernel to see if that helps things.

I would like to run the test case shown in comment #25, but the /sys/class/scsi_host/*/link_power_management_policy control file is not present on my system. Does anyone know if there any other known test cases to reproduce this problem?

g.

Revision history for this message
Stuart (stuartneilson) wrote :
Download full text (4.2 KiB)

I hope that this is helpful. My computer is running well with kernel 2.6.34 using Ubuntu Lucid 10.04. My computer freezes if I boot using kernel 2.6.32-22.

I am able to keep my computer functioning after the events causing a freeze by mounting my root filesystem using the option data=writeback instead of the default data=ordered. It is necessary to modify /etc/fstab, /etc/defaults/grub and to run tune2fs (according to http://www.goitexpert.com/general/ubuntuguide/) to mount the root filesystem using non-default options.

I am currently posting from a computer that appears to be "half-frozen" so I can post any further information that is of interest. Thunderbird, several terminals and some inodes are locked.

Once I boot using data=writeback, a process freeze produces the following messages, and I suspect that without data=writeback the system would be frozen:

[ 672.286366] EXT4-fs warning (device sda4): dx_probe: dx entry: limit != root limit
[ 672.286375] EXT4-fs warning (device sda4): dx_probe: Corrupt dir inode 308015, running e2fsck is recommended.
[ 672.286405] BUG: unable to handle kernel paging request at f76f5006
[ 672.286411] IP: [<c029113d>] ext4_find_entry+0x14d/0x410
[ 672.286424] *pde = 00007067 *pte = 77520002
[ 672.286429] Oops: 0000 [#1] SMP
[ 672.286433] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0A:00/power_supply/BAT1/charge_full
[ 672.286439] Modules linked in: aes_i586 aes_generic binfmt_misc ppdev snd_hda_codec_idt joydev fbcon tileblit font bitblit softcursor vga16fb vgastate snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss arc4 snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer radeon snd_seq_device b43 ttm mac80211 drm_kms_helper snd cfg80211 drm i2c_algo_bit dell_wmi ati_agp soundcore sdhci_pci sdhci dell_laptop dcdbas led_class ricoh_mmc i2c_piix4 snd_page_alloc shpchp agpgart k8temp psmouse serio_raw video output lp parport b44 mii ssb pata_atiixp ahci
[ 672.286499]
[ 672.286504] Pid: 2781, comm: thunderbird-bin Not tainted (2.6.32-22-generic #36-Ubuntu) Inspiron 1501
[ 672.286509] EIP: 0060:[<c029113d>] EFLAGS: 00010216 CPU: 0
[ 672.286515] EIP is at ext4_find_entry+0x14d/0x410
[ 672.286519] EAX: f76f6000 EBX: f76f5000 ECX: 0000000c EDX: 00003000
[ 672.286523] ESI: f76f500d EDI: f6f0d580 EBP: f1a15dc4 ESP: f1a15d3c
[ 672.286527] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 672.286532] Process thunderbird-bin (pid: 2781, ti=f1a14000 task=f2ede680 task.ti=f1a14000)
[ 672.286536] Stack:
[ 672.286538] 00000000 f1a15db0 00000202 c01cc223 00000000 00003000 f6f70be0 f1a15ddc
[ 672.286546] <0> f23b78dc f68e5800 f23b78a0 00000000 00000003 f6f0d340 00000004 f6f70c68
[ 672.286554] <0> 00001000 00000005 00000005 00000003 0000000d f478f080 f6f0d580 f6f0d340
[ 672.286563] Call Trace:
[ 672.286570] [<c01cc223>] ? mempool_free_slab+0x13/0x20
[ 672.286577] [<c0291445>] ? ext4_lookup+0x45/0x100
[ 672.286582] [<c058b6fd>] ? _spin_lock+0xd/0x10
[ 672.286588] [<c021aeeb>] ? d_alloc+0x13b/0x190
[ 672.286594] [<c0210817>] ? real_lookup+0xb7/0x110
[ 672.286599] [<c0212265>] ? do_lookup+0x95/0xc0
[ 672.286605] [<c016ed2...

Read more...

Andy Whitcroft (apw)
Changed in linux (Ubuntu Lucid):
assignee: Andy Whitcroft (apw) → nobody
Revision history for this message
Scott Testerman (scott-testerman) wrote :

This problem is no longer present in the default Maverick Meerkat kernel (currently 2.6.35-9-generic #14-Ubuntu SMP).

Revision history for this message
raketenman (sesselastronaut) wrote :

i have this problem with the 2.6.31-11-rt kernel:
EXT4-fs error (device sda3): ext4_lookup: deleted inode referenced

Revision history for this message
bitinerant (bitinerant) wrote :

It was noted above that kernel 2.6.33 was, in one case, 'cured' of this bug, but 2.6.33-02063305-generic on my MacBook2,1 still had file corruption (all-be-it after much more time than 2.6.32-23). Now that 2.6.32-24 is the default for Ubuntu 10.04, I am testing it and it has gone longer than usual (15 days now) without any problems.

Eugene San (eugenesan)
Changed in linux (Ubuntu Lucid):
status: Incomplete → Confirmed
Revision history for this message
Mark Deneen (mdeneen) wrote :

I just had this happen with 2.6.32-27-server on 10.04 server. The filesystem was a 4TB ext4 volume, and I was copying millions of small files to it.

I think that I will be using a different filesystem for this. =/

Revision history for this message
chastell (chastell) wrote :

> I think that I will be using a different filesystem for this.

Simply install a vanilla (mainline) kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline

Revision history for this message
Mark Deneen (mdeneen) wrote :

Shot:

Do I have to use one with "lucid" in the name? The most recent kernel version like that is somewhat old.

Revision history for this message
chastell (chastell) wrote :

I’m not sure; I’m using 2.6.34-020634-generic (so yeah, that old, not-RC one) and it works ok. You can also try looking for Lucid backports of newer mainline (vanilla) kernels.

Revision history for this message
Mark Deneen (mdeneen) wrote :

I'll give that a shot. I also had a similar issue when attempting the same copy with ext3.

Let's hope that the mainline kernel works for me. It would probably be a good idea to identify what changed between 2.6.32 and 2.6.34 with respect to ext4 and ext3 file systems. It can't be good to have this bug on pretty much every ubuntu-server LTS.

Revision history for this message
Mark Deneen (mdeneen) wrote :

The 2.6.34 mainline kernel didn't help. This appears to be the bug I am running into:

https://bugzilla.redhat.com/show_bug.cgi?id=626684

I'm having trouble searching the ubuntu bug database right now, so I can't tell if there is a bug report for this or not. I am building the 10.10 qemu-kvm package on my 10.04 host, so we'll see how that goes.

Revision history for this message
Mark Deneen (mdeneen) wrote :

I just wanted to note that it appears that the update to qemu 0.12.5 has fixed my problem. I will submit a bug report for this.

Revision history for this message
Tony Groff (s-launchpad-groffweb-com) wrote :

Hello, I've spend the past two days troubleshooting this issue with my Ubuntu 10.04 VirtualBox machine. I first upgraded the kernel to 2.6.35 and still had the problems. Long story short - what fixed this issue for me (running Ubuntu 10.04 (EXT4) as a guest on a Windows 7 x64 host) was the following post:

Virtual disk corruption on ext4 file systems
http://tenbulls.co.uk/2010/09/06/virtualb-disk-corruption-on-ext4-file-systems/

As soon as I changed my SATA drive to "use host I/O cache" all my problems were solved and performance increased dramatically. Hope this helps someone else someday.

-Tony

Revision history for this message
Jorge Machado (machadofisher) wrote :

Dear all I'm having the same problem in 10.0.4 using kernel 2.6.32-34

I will try the 2.6.34 but I would like to have a script to guarantee that the problem really disapear
A script to crash the system

Regarding post 25 I can't execute the second script My ubuntu does not have does power management files
and did not let me to create them

Thanks

Revision history for this message
Daniel Lago (daniellago85) wrote :

I'm using Quantal, Kernel 3.5.0-17-generic . The problem still occurs for me. Using ext4. I've replaced the Hard DIsk five times. It occurs with my 3 Western Digital Caviar Blacks (1 TB 64 MB cache), and with my Caviar Green (1.5 TB, 64 MB cache), and also (at least until now) my old Seagate Barracuda (appers to occur with less frequency).

badblocks reports no errors, also mkfs.ext4 -cc; but fsck -vcck sometimes reports lists of badblocks (I have no idea why). I've noticed that when I create two partitions at the same disk, and uses then intensily at the same time, the problems occurs in a question of few minutes (otherwise it is random).

Revision history for this message
Daniel Lago (daniellago85) wrote :

It occurs more frequently using RAID0 with mdadm, and also (less frequent) using dmraid.

This problem does NOT occur using a Virtual Machine under Windows.

Revision history for this message
Rolf Leggewie (r0lf) wrote :

lucid has seen the end of its life and is no longer receiving any updates. Marking the lucid task for this ticket as "Won't Fix".

Changed in linux (Ubuntu Lucid):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.