task blocked for more than 120 seconds on server kernel

Bug #652812 reported by Lars
48
This bug affects 9 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Medium
Unassigned

Bug Description

Hi,

this is about a ubuntu server version.
The server consists mainly of fast HDDs and 2 external attached LTO-3 tape drives in a changer.
It's purpose is to sync with other servers and then write ewverything onto both tape drives in parallel overnight.

The following is our main problem:
[ 1081.590063] INFO: task mbuffer1:2589 blocked for more than 120 seconds.
[ 1081.590577] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.591151] mbuffer1 D 0000000000000000 0 2589 2560 0x00000000
[ 1081.591162] ffff88080cee9c18 0000000000000082 0000000000015bc0 0000000000015bc0
[ 1081.591173] ffff8803f87ac890 ffff88080cee9fd8 0000000000015bc0 ffff8803f87ac4d0
[ 1081.591181] 0000000000015bc0 ffff88080cee9fd8 0000000000015bc0 ffff8803f87ac890
[ 1081.591189] Call Trace:
[ 1081.591208] [<ffffffff815583ad>] schedule_timeout+0x22d/0x300
[ 1081.591220] [<ffffffff812b4567>] ? kobject_put+0x27/0x60
[ 1081.591228] [<ffffffff81559f45>] ? _spin_lock_irq+0x15/0x20
[ 1081.591238] [<ffffffff8138a90a>] ? scsi_request_fn+0xda/0x5e0
[ 1081.591246] [<ffffffff81557656>] wait_for_common+0xd6/0x180
[ 1081.591256] [<ffffffff8129de33>] ? __generic_unplug_device+0x33/0x40
[ 1081.591266] [<ffffffff8105a350>] ? default_wake_function+0x0/0x20
[ 1081.591286] [<ffffffffa015c4d8>] ? T.945+0x158/0x170 [st]
[ 1081.591294] [<ffffffff815577bd>] wait_for_completion+0x1d/0x20
[ 1081.591305] [<ffffffffa015c637>] T.944+0x127/0x270 [st]
[ 1081.591315] [<ffffffffa0162092>] st_write+0x5a2/0xc70 [st]
[ 1081.591324] [<ffffffff8105a380>] ? wake_up_state+0x10/0x20
[ 1081.591334] [<ffffffff81143aa8>] vfs_write+0xb8/0x1a0
[ 1081.591342] [<ffffffff81144311>] sys_write+0x51/0x80
[ 1081.591351] [<ffffffff810121b2>] system_call_fastpath+0x16/0x1b
[ 1081.591358] INFO: task mbuffer2:2608 blocked for more than 120 seconds.
[ 1081.591800] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.592374] mbuffer2 D 0000000000000000 0 2608 2591 0x00000000
[ 1081.592383] ffff8800df895c18 0000000000000082 0000000000015bc0 0000000000015bc0
[ 1081.592392] ffff8803f87a9ab0 ffff8800df895fd8 0000000000015bc0 ffff8803f87a96f0
[ 1081.592400] 0000000000015bc0 ffff8800df895fd8 0000000000015bc0 ffff8803f87a9ab0
[ 1081.592408] Call Trace:
[ 1081.592417] [<ffffffff815583ad>] schedule_timeout+0x22d/0x300
[ 1081.592425] [<ffffffff812b4567>] ? kobject_put+0x27/0x60
[ 1081.592432] [<ffffffff81559f45>] ? _spin_lock_irq+0x15/0x20
[ 1081.592439] [<ffffffff8138a90a>] ? scsi_request_fn+0xda/0x5e0
[ 1081.592448] [<ffffffff81557656>] wait_for_common+0xd6/0x180
[ 1081.592456] [<ffffffff8129de33>] ? __generic_unplug_device+0x33/0x40
[ 1081.592464] [<ffffffff8105a350>] ? default_wake_function+0x0/0x20
[ 1081.592474] [<ffffffffa015c4d8>] ? T.945+0x158/0x170 [st]
[ 1081.592482] [<ffffffff815577bd>] wait_for_completion+0x1d/0x20
[ 1081.592492] [<ffffffffa015c637>] T.944+0x127/0x270 [st]
[ 1081.592502] [<ffffffffa0162092>] st_write+0x5a2/0xc70 [st]
[ 1081.592510] [<ffffffff8105a380>] ? wake_up_state+0x10/0x20
[ 1081.592518] [<ffffffff81143aa8>] vfs_write+0xb8/0x1a0
[ 1081.592525] [<ffffffff81144311>] sys_write+0x51/0x80
[ 1081.592533] [<ffffffff810121b2>] system_call_fastpath+0x16/0x1b

After the 5th 120s delay the following aborts the backup:
[ 1818.980059] mptscsih: ioc1: attempting task abort! (sc=ffff880057bb7000)
[ 1818.980067] st 6:0:4:0: CDB: Write(6): 0a 00 04 00 00 00
[ 1829.300042] mptscsih: ioc1: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!!
[ 1831.280030] mptscsih: ioc1: task abort: SUCCESS (sc=ffff880057bb7000)
[ 1831.282296] mptscsih: ioc1: attempting task abort! (sc=ffff880057bb6a00)
[ 1831.282302] st 6:0:5:0: CDB: Write(6): 0a 00 04 00 00 00
[ 1831.282321] mptscsih: ioc1: task abort: SUCCESS (sc=ffff880057bb6a00)
[ 1831.284945] st0: Error 80000 (driver bt 0x0, host bt 0x8).
[ 1831.285106] st1: Error 80000 (driver bt 0x0, host bt 0x8).
[ 1831.490044] scsi target6:0:4: Beginning Domain Validation
[ 1831.637097] scsi target6:0:4: Ending Domain Validation
[ 1831.637208] scsi target6:0:4: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 64)
[ 1834.150032] scsi target6:0:5: Beginning Domain Validation
[ 1834.297533] scsi target6:0:5: Ending Domain Validation
[ 1834.297649] scsi target6:0:5: FAST-160 WIDE SCSI 320.0 MB/s DT IU RTI PCOMP (6.25 ns, offset 64)
[ 1910.340056] scsi target6:0:5: Beginning Domain Validation
[ 1910.729074] scsi target6:0:5: Ending Domain Validation
[ 1910.729194] scsi target6:0:5: FAST-160 WIDE SCSI 320.0 MB/s DT IU RTI PCOMP (6.25 ns, offset 64)

This is with the SAS-LSI driver manually updated to version:
# cat /sys/module/mptbase/version
4.24.00.00

because I get lost connections to SATA drives with the driver supplied with the kernel (was with 2.6.32-23).

This is a really serious bug for this server! It prevents it from doing backups.
Please also read Bug 494476

regards
Lars

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-image-2.6.32-25-server 2.6.32-25.44 [modified: lib/modules/2.6.32-25-server/kernel/drivers/message/fusion/mptbase.ko lib/modules/2.6.32-25-server/kernel/drivers/message/fusion/mptctl.ko lib/modules/2.6.32-25-server/kernel/drivers/message/fusion/mptfc.ko lib/modules/2.6.32-25-server/kernel/drivers/message/fusion/mptlan.ko lib/modules/2.6.32-25-server/kernel/drivers/message/fusion/mptsas.ko lib/modules/2.6.32-25-server/kernel/drivers/message/fusion/mptscsih.ko lib/modules/2.6.32-25-server/kernel/drivers/message/fusion/mptspi.ko]
Regression: No
Reproducible: Yes
ProcVersionSignature: Ubuntu 2.6.32-25.44-server 2.6.32.21+drm33.7
Uname: Linux 2.6.32-25-server x86_64
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access /dev/snd/: No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
CurrentDmesg:

Date: Fri Oct 1 10:20:57 2010
MachineType: Supermicro H8DI3+
PciMultimedia:

ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.32-25-server root=LABEL=WURZEL ro elevator=noop quiet splash
ProcEnviron:
 LANG=de_DE.UTF-8
 SHELL=/bin/bash
SourcePackage: linux
dmi.bios.date: 12/07/2009
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1.0b
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: H8DI3+
dmi.board.vendor: Supermicro
dmi.board.version: 1234567890
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 3
dmi.chassis.vendor: Supermicro
dmi.chassis.version: 1234567890
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1.0b:bd12/07/2009:svnSupermicro:pnH8DI3+:pvr1234567890:rvnSupermicro:rnH8DI3+:rvr1234567890:cvnSupermicro:ct3:cvr1234567890:
dmi.product.name: H8DI3+
dmi.product.version: 1234567890
dmi.sys.vendor: Supermicro

Revision history for this message
Lars (lars-taeuber) wrote :
Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi Lars,

If you could also please test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Lars (lars-taeuber) wrote :

Hi Jeremy,

I tried the following kernel:
2.6.34-020634_amd64
but couldn't boot.

The result is attached as jpeg.

Regards
Lars

Revision history for this message
Matthias Henze (matthias-mhcsoftware) wrote :

The same problem here on a HP ML350G6 with a SC11Xe SCSI controller. Problem occures randomly as soon as a SCSI tape drive is attached to the controller. Without a tape drive but with the controller installed ther is no problme at all. Seen at:

Distributor ID: Ubuntu
Description: Ubuntu 10.04.1 LTS
Release: 10.04

Linux basis01 2.6.32-25-server #44-Ubuntu SMP Fri Sep 17 21:13:39 UTC 2010 x86_64 GNU/Linux

Revision history for this message
Charlie Kravetz (cjkgeek) wrote :

 Thanks for reporting this bug and any supporting documentation. Since this bug has enough information provided for a developer to begin work, I'm going to mark it as confirmed and let them handle it from here.

@Matthias: Please file a new bug using "ubuntu-bug linux" for your hardware. Since very small hardware changes can require a different fix, a new bug report brings your individual hardware to the attention of the developers. You can add a comment to your bug referencing bug 652812 as very similar.

Thanks for taking the time to make Ubuntu better!

Thanks for taking the time to make Ubuntu better!

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Incomplete → Confirmed
Revision history for this message
Lars (lars-taeuber) wrote :

Hi,

the probability is significantly higher when writing to both tape drives in parallel.
When writing to only one drive there is a good chance to get it done correctly.

By the way I use mbuffer and a lot of memory for the backup process:

normally both in parallel:
RAID6 -> XFS -> tar -> mbuffer1 (12GB RAM) -> tape drive1 (LTO-3)
RAID6 -> XFS -> tar -> mbuffer2 (12GB RAM) -> tape drive2 (LTO-3)

regards
Lars

Revision history for this message
Matthias Henze (matthias-mhcsoftware) wrote :

@Charlie

With a new kernel from https://wiki.ubuntu.com/KernelMainlineBuilds my problem was fixed. For my tests I've used 2.6.36-999-generic and had no more SCSI issues. But I've to use the old kernel where SCSI is not operable as I'm unable to build VMware modules for the newer kernel.

Revision history for this message
Lars (lars-taeuber) wrote :

Hi Matthias,

did you use the kernel from this directory (maverick kernel on lucid system?):

http://kernel.ubuntu.com/~kernel-ppa/mainline/daily/2010-10-10-maverick/

Thanks
Lars

Revision history for this message
Matthias Henze (matthias-mhcsoftware) wrote :
Revision history for this message
Lars (lars-taeuber) wrote :

Hi there!

I couldn't boot the kernel Matthias had success with. It ended up with a similar screen like the kernel I tried before.
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/652812/comments/3

Regards
Lars

Revision history for this message
penalvch (penalvch) wrote :

Lars, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? If so, could you please test the latest upstream kernel available following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Please do not test the daily folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-v3.11-rc5

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

If you are unable to test the mainline kernel, please comment as to why specifically you were unable to test it and add the following tags:
kernel-unable-to-test-upstream
kernel-unable-to-test-upstream-VERSION-NUMBER

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

tags: added: latest-bios-1.0b regression-potential
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.