task jbd2/cciss!c0d0:258 blocked for more than 120 seconds

Bug #666828 reported by Ralf Hildebrandt
58
This bug affects 10 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Unassigned
Lucid
Fix Released
Medium
Unassigned
Maverick
Fix Released
Medium
Unassigned
Natty
Fix Released
Medium
Unassigned

Bug Description

Machine became unreachble via ssh. Logging in via the console switch we saw
task jbd2/cciss!c0d0:258 blocked for more than 120 seconds

The machine was completely unresponsive, all I could do was scroll up and down. ctrl-alt-del wouldn't work.

ProblemType: Bug
DistroRelease: Ubuntu 10.10
Package: linux-image-server 2.6.35.22.23
Regression: Yes
Reproducible: No
ProcVersionSignature: Ubuntu 2.6.35-22.35-server 2.6.35.4
Uname: Linux 2.6.35-22-server x86_64
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access /dev/snd/: No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
Date: Tue Oct 26 17:03:00 2010
Frequency: Once every few days.
InstallationMedia: Ubuntu-Server 10.04.1 LTS "Lucid Lynx" - Release amd64 (20100816.2)
Lsusb:
 Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: HP ProLiant DL360 G4
PciMultimedia:

ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.35-22-server root=UUID=4cc50c45-22ab-445d-a812-bd88e377aa86 ro noquiet panic=30
ProcEnviron:
 PATH=(custom, no user)
 LANG=C
 SHELL=/bin/tcsh
SourcePackage: linux
dmi.bios.date: 04/14/2005
dmi.bios.vendor: HP
dmi.bios.version: P52
dmi.chassis.type: 23
dmi.chassis.vendor: HP
dmi.modalias: dmi:bvnHP:bvrP52:bd04/14/2005:svnHP:pnProLiantDL360G4:pvr:cvnHP:ct23:cvr:
dmi.product.name: ProLiant DL360 G4
dmi.sys.vendor: HP

Revision history for this message
Ralf Hildebrandt (ralf-hildebrandt) wrote :
Revision history for this message
John M. Carlin (thejohnny) wrote :

I'm experiencing this issue on 10.10, 2.6.35-24-server on the same hardware (HP ProLiant DL360 G4).

Revision history for this message
HansLambermont (hans-lambermont) wrote :

Same issue here "task jbd2/cciss!c0d0 blocked for more than 120 seconds" on a HP Proliant ML350 G3 (32-bits Xeon's).
Cannot log in via console. Ubuntu 10.10.

Revision history for this message
kraucer (kraucer-gmail) wrote :

This bug also affects me:
Ubuntu Server 10:10 - kernel 2.6.35-22

Revision history for this message
Nick (lpd738) wrote :

Ubuntu Natty 2.6.38-7-server AMD64 - same issue, same hardware. Tried updating to the Alpha build just to see, but no difference.

Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Ben Baumer (ben-baumer) wrote :

I think I am having the same problem. I have Ubuntu Server 10.10 (kernel 2.6.35-28) running on an HP ProLiant DL380 G4. I posted some more specifics here:

http://ubuntuforums.org/showthread.php?p=10722640#post10722640

tags: removed: regression-potential
Revision history for this message
Bernd Zeimetz (bzed) wrote :
Revision history for this message
Bernd Zeimetz (bzed) wrote :

Is anybody using arrayprobe, the hp-health tools or similar programs to monitor the raid arrays health?

Revision history for this message
Ralf Hildebrandt (ralf-hildebrandt) wrote :

This is a patch wich supposedly fixes the problem.

Revision history for this message
Stefan Bader (smb) wrote :

Patch made it upstream into 2.6.39-rc1, it was picked up by stable and is included in
- 2.6.32-32.62 (10.04 Lucid)
- 2.6.35-29.51 (10.10 Maverick) [this is currently in proposed only]
- 2.6.38-9.43 (11.04 Natty) [this is currently in proposed only]

Changed in linux (Ubuntu Lucid):
importance: Undecided → Medium
status: New → Fix Released
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Fix Released
Changed in linux (Ubuntu Maverick):
importance: Undecided → Medium
status: New → Fix Committed
Changed in linux (Ubuntu Natty):
importance: Undecided → Medium
status: New → Fix Committed
Revision history for this message
Julian Wiedmann (jwiedmann) wrote :

For Maverick, this should be fixed now with 2.6.35-30.54.
Please reopen the Maverick task if you still experience this issue.

Changed in linux (Ubuntu Maverick):
status: Fix Committed → Fix Released
Revision history for this message
Ben Baumer (ben-baumer) wrote :

For me, upgrading the kernel to 2.6.38-10 did not solve the problem, but upgrading the SmartArray firmware as suggested in the Red Hat forum listed above did work.

Revision history for this message
Joshua Ebarvia (jemate18) wrote :

I'm also having the same issue with this.

Why is the importance set to "medium"? I think this should be a priority. Server becomes unresponsive on a random basis. Sometimes daily, weekly, or anytime it likes to.

I'm using
- Ubuntu 11.10 Server 64bit
- Linux MyServer 3.0.0-12-server #20-Ubuntu SMP Fri Oct 7 16:36:30 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
- HP Proliant ML 370 G3 X3.

How come kernel 3.0 is still affected?

Please confirm patch or fix release for Oneiric Servers.

Thank you very much

Revision history for this message
Stuart Longland (redhatter) wrote :

Not sure what the official status is, but I'm getting this problem with a HP ProLiant box here running Ubuntu 12.04 LTS and linux-image-3.5.0-25-generic as well as the 3.2-series kernels.

For me the machine will run fine for about a day, then everything locks up. If I'm lucky, I can SSH in, but then `dmesg` hangs. `fold -w 80 /dev/vcs1` dumps the following output:

 message.
[158400.224328] INFO: task oned:3106 blocked for more than 120 seconds.
[158400.224337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
 message.
[158400.224491] INFO: task mm_sched:3116 blocked for more than 120 seconds.
[158400.224501] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
 message.
[158520.224030] INFO: task rs:main Q:Reg:888 blocked for more than 120 seconds.
[158520.224046] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
 message.
[158520.224327] INFO: task oned:3106 blocked for more than 120 seconds.
[158520.224336] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
 message.
[158520.224503] INFO: task mm_sched:3116 blocked for more than 120 seconds.
[158520.224524] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
 message.
[158640.224028] INFO: task rs:main Q:Reg:888 blocked for more than 120 seconds.
[158640.224056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
 message.
[158640.224362] INFO: task oned:3106 blocked for more than 120 seconds.
[158640.224382] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
 message.

The box previously worked reliably running VMWare ESXi. The link to the Red Hat bug database seems to require a log in to see the bug, so at time of writing, the bug is inaccessible to me and I'm unable to follow any advice there. (I get an error message "You are not authorized to access bug #615543.")

I'm trying a complete reload (I install using a PXE boot image) to see if a newer kernel has been released that fixes the issue -- also to clean up the last vestiges of an OpenNebula installation that's no longer in use.

Revision history for this message
Stuart Longland (redhatter) wrote :

Just re-loaded, lspci reports the affected controller as being the following:

07:01.0 RAID bus controller: Compaq Computer Corporation Smart Array 64xx (rev 01)
        Subsystem: Compaq Computer Corporation Smart Array 642
        Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 18
        Memory at fdff0000 (64-bit, non-prefetchable) [size=8K]
        I/O ports at 5000 [size=256]
        Memory at fdf80000 (64-bit, non-prefetchable) [size=256K]
        [virtual] Expansion ROM at 80300000 [disabled] [size=256K]
        Capabilities: [d0] Power Management version 2
        Capabilities: [dc] PCI-X non-bridge device
        Capabilities: [f0] Vital Product Data
        Kernel driver in use: cciss
        Kernel modules: cciss

During boot, I see in `dmesg`:
[ 1.097409] HP CISS Driver (v 3.6.26)
[ 1.098444] cciss 0000:07:01.0: PCI IRQ 50 -> rerouted to legacy IRQ 18
[ 1.098524] cciss 0000:07:01.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
[ 1.098694] cciss 0000:07:01.0: Controller reports max supported commands of 0, an obvious lie. Using 16. Ensure that firmware is up to date.
[ 1.216111] cciss 0000:07:01.0: cciss0: <0x46> at PCI 0000:07:01.0 IRQ 18 using DAC

I'm not sure what the firmware update procedure is on these things.

Revision history for this message
Ralf Hildebrandt (ralf-hildebrandt) wrote : Re: [Bug 666828] Re: task jbd2/cciss!c0d0:258 blocked for more than 120 seconds

* Stuart Longland <email address hidden>:

> [ 1.097409] HP CISS Driver (v 3.6.26)
> [ 1.098444] cciss 0000:07:01.0: PCI IRQ 50 -> rerouted to legacy IRQ 18
> [ 1.098524] cciss 0000:07:01.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
> [ 1.098694] cciss 0000:07:01.0: Controller reports max supported commands of 0, an obvious lie. Using 16. Ensure that firmware is up to date.
> [ 1.216111] cciss 0000:07:01.0: cciss0: <0x46> at PCI 0000:07:01.0 IRQ 18 using DAC
>
> I'm not sure what the firmware update procedure is on these things.

That's easy: You boot into a HP firmware update CD, which is basically a
linux boot disk.

The filename's FW1010.2012_0530.49.iso
I got a copy if you can't find the download...

--
Ralf Hildebrandt Charite Universitätsmedizin Berlin
<email address hidden> Campus Benjamin Franklin
http://www.charite.de Hindenburgdamm 30, 12203 Berlin
Geschäftsbereich IT, Abt. Netzwerk fon: +49-30-450.570.155

Revision history for this message
Stuart Longland (redhatter) wrote :

Fair enough, I think I found some related files on the HP site, been reading through those.

What I don't understand is this: why is it suddenly a firmware issue now on Ubuntu 12.04, when it wasn't an issue on VMWare ESX? The hardware has not changed. If there's a firmware bug, fine, but why was it handled before and not now?

I managed to get an oops out of the current running kernel (3.2.0-40-generic):

[69240.280022] INFO: task rs:main Q:Reg:405 blocked for more than 120 seconds.
[69240.280037] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[69240.280050] rs:main Q:Reg D 0000000000000001 0 405 1 0x00000000
[69240.280057] ffff880036fbb9d8 0000000000000082 0000000000000001 0000001000000000
[69240.280065] ffff880036fbbfd8 ffff880036fbbfd8 ffff880036fbbfd8 00000000000137c0
[69240.280073] ffff8800785a1700 ffff88007a904500 ffff880036fbb9b8 ffff88007fc54080
[69240.280081] Call Trace:
[69240.280093] [<ffffffff81118630>] ? __lock_page+0x70/0x70
[69240.280102] [<ffffffff8165c58f>] schedule+0x3f/0x60
[69240.280107] [<ffffffff8165c63f>] io_schedule+0x8f/0xd0
[69240.280113] [<ffffffff8111863e>] sleep_on_page+0xe/0x20
[69240.280118] [<ffffffff8165ce5f>] __wait_on_bit+0x5f/0x90
[69240.280122] [<ffffffff811187a8>] wait_on_page_bit+0x78/0x80
[69240.280129] [<ffffffff8108bf00>] ? autoremove_wake_function+0x40/0x40
[69240.280134] [<ffffffff811194b2>] grab_cache_page_write_begin+0x92/0xe0
[69240.280140] [<ffffffff81078098>] ? lock_timer_base.isra.29+0x38/0x70
[69240.280176] [<ffffffffa00b81e0>] ? xfs_get_blocks_direct+0x20/0x20 [xfs]
[69240.280183] [<ffffffff811ad4a8>] block_write_begin+0x38/0xa0
[69240.280201] [<ffffffffa00b82a3>] xfs_vm_write_begin+0x43/0x70 [xfs]
[69240.280207] [<ffffffff81118a7a>] generic_perform_write+0xca/0x210
[69240.280213] [<ffffffff81118c1d>] generic_file_buffered_write+0x5d/0x90
[69240.280232] [<ffffffffa00be19c>] xfs_file_buffered_aio_write+0xfc/0x1c0 [xfs]
[69240.280238] [<ffffffff8165d736>] ? down_write+0x16/0x40
[69240.280257] [<ffffffffa00be3dc>] xfs_file_aio_write+0x17c/0x2a0 [xfs]
[69240.280263] [<ffffffff8109ead2>] ? unqueue_me+0x52/0x80
[69240.280267] [<ffffffff8109fc48>] ? futex_wait+0x108/0x210
[69240.280273] [<ffffffff8117900a>] do_sync_write+0xda/0x120
[69240.280279] [<ffffffff812d9c28>] ? apparmor_file_permission+0x18/0x20
[69240.280285] [<ffffffff8129f3bc>] ? security_file_permission+0x2c/0xb0
[69240.280289] [<ffffffff811795b1>] ? rw_verify_area+0x61/0xf0
[69240.280294] [<ffffffff81179913>] vfs_write+0xb3/0x180
[69240.280298] [<ffffffff81179c3a>] sys_write+0x4a/0x90
[69240.280303] [<ffffffff81666a82>] system_call_fastpath+0x16/0x1b

dino99 (9d9)
Changed in linux (Ubuntu Natty):
status: Fix Committed → Fix Released
Revision history for this message
Altin Ukshini (altin) wrote :

Ok, seems like this issue is still present...

I'm running Ubuntu 14.04.2 LTS
Kernel: 3.16.0-41-generic x86_64
Hardware: HP Proliant DL380

...but instead I get:
task Not tainted 3.16.0-41-generic #55~14.04.1-Ubuntu
task bash:17087 blocked for more than 120 seconds
task jbd2/cciss!c0d0:156 blocked for more than 120 seconds
task rs:main Q:Reg:633 blocked for more than 120 seconds
...

Any fixes for Trysty?

Revision history for this message
AlainKnaff (kubuntu-misc) wrote :

Just saw this one as well today, on a 12.04.5 with a 3.13 kernel

Ssh, and serial console were unresponsive.

Console displayed message about jbd2 being blocked for more than 120 seconds, and then loads of other processes (apache2) blocked for 120 seconds as well (but jbd2 was the first such message chronologically)

Scrolling up and down virtual console worked.

Changing virtual consoles worked as well, login prompts were displayed on the other virtual consoles, but after entering user name and pressing return there would just hang (no output, apart user name itself)

After reboot (by killing the kvm process handling this virtual machine), logfiles showed a bunch of NUL characters (^@) before the reboot messages, and last message before these NULs was approx 2 hours before the crash.

We've been running this virtual machine for a while, and this is the first time this happened.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.