"Smbd","kjournald2" and "rsync" blocked for more than 120 seconds while using ext4.

Bug #494476 reported by Okkulter
78
This bug affects 15 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
Medium
Unassigned

Bug Description

Binary package hint: gnome-utils

hi there,
here are some additional information omn this bug:

during the creation of an .bin file (with imgburn) via my MS Windows Vista client the application reaches a
timeout with lossing the connection to the samba server for a while.
the ubuntu 9.10 samba server generates this messages in syslog

Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430686] INFO: task smbd:15032 blocked for more than 120 seconds.
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430693] smbd D c08145c0 0 15032 2688 0x00000004
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430697] f4f9dba8 00000086 f4f9db98 c08145c0 f603e718 c08145c0 5860a799 0000058c
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430703] c08145c0 c08145c0 f603e718 c08145c0 585ff473 0000058c c08145c0 f66dfc00
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430707] f603e480 f4f9dbdc f603e480 f4e71748 f4f9dbd4 c0570b55 fffeffff f1858e40
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430712] Call Trace:
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430719] [<c0570b55>] rwsem_down_failed_common+0x75/0x1a0
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430723] [<c04c5c68>] ? ip_local_out+0x18/0x20
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430726] [<c0570ccd>] rwsem_down_read_failed+0x1d/0x30
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430729] [<c0570d27>] call_rwsem_down_read_failed+0x7/0x10
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430731] [<c05702e7>] ? down_read+0x17/0x20
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430735] [<c026577a>] ext4_get_blocks+0x3a/0x200
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430738] [<c02085c0>] ? alloc_buffer_head+0x10/0x40
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430740] [<c02659ca>] ext4_da_get_block_prep+0x8a/0x120
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430743] [<c020b86a>] __block_prepare_write+0x14a/0x3a0
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430747] [<c01b3114>] ? add_to_page_cache_lru+0x44/0x70
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430750] [<c020bc6e>] block_write_begin+0x4e/0xe0
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430752] [<c0265940>] ? ext4_da_get_block_prep+0x0/0x120
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430755] [<c0268c18>] ext4_da_write_begin+0x168/0x300
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430757] [<c0265940>] ? ext4_da_get_block_prep+0x0/0x120
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430760] [<c01b26c9>] generic_perform_write+0xa9/0x190
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430763] [<c01b33fd>] generic_file_buffered_write+0x5d/0x150
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430766] [<c01b4da3>] __generic_file_aio_write_nolock+0x1f3/0x510
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430770] [<c03b918a>] ? scsi_next_command+0x3a/0x50
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430772] [<c01b51d9>] generic_file_aio_write+0x59/0xd0
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430776] [<c025f6fc>] ext4_file_write+0x4c/0x180
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430780] [<c01e765c>] do_sync_write+0xbc/0x100
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430783] [<c015c280>] ? autoremove_wake_function+0x0/0x40
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430787] [<c014b405>] ? __do_softirq+0xe5/0x1a0
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430790] [<c018f8fc>] ? handle_IRQ_event+0x4c/0x140
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430794] [<c02c7fbf>] ? security_file_permission+0xf/0x20
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430797] [<c01e77ff>] ? rw_verify_area+0x5f/0xe0
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430799] [<c048e176>] ? sys_recv+0x36/0x40
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430802] [<c01e791a>] vfs_write+0x9a/0x190
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430804] [<c01e75a0>] ? do_sync_write+0x0/0x100
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430807] [<c0104f10>] ? do_IRQ+0x50/0xc0
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430810] [<c01e84f3>] sys_pwrite64+0x63/0x80
Dec 9 11:38:59 MeinFileServer kernel: [ 6240.430812] [<c010336c>] syscall_call+0x7/0xb

... and once again....

Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428543] INFO: task smbd:15032 blocked for more than 120 seconds.
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428546] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428549] smbd D c08145c0 0 15032 2688 0x00000004
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428553] f4f9dba8 00000086 f4f9db98 c08145c0 f603e718 c08145c0 5860a799 0000058c
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428558] c08145c0 c08145c0 f603e718 c08145c0 585ff473 0000058c c08145c0 f66dfc00
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428563] f603e480 f4f9dbdc f603e480 f4e71748 f4f9dbd4 c0570b55 fffeffff f1858e40
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428567] Call Trace:
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428575] [<c0570b55>] rwsem_down_failed_common+0x75/0x1a0
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428579] [<c04c5c68>] ? ip_local_out+0x18/0x20
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428582] [<c0570ccd>] rwsem_down_read_failed+0x1d/0x30
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428584] [<c0570d27>] call_rwsem_down_read_failed+0x7/0x10
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428587] [<c05702e7>] ? down_read+0x17/0x20
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428590] [<c026577a>] ext4_get_blocks+0x3a/0x200
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428593] [<c02085c0>] ? alloc_buffer_head+0x10/0x40
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428596] [<c02659ca>] ext4_da_get_block_prep+0x8a/0x120
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428599] [<c020b86a>] __block_prepare_write+0x14a/0x3a0
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428602] [<c01b3114>] ? add_to_page_cache_lru+0x44/0x70
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428605] [<c020bc6e>] block_write_begin+0x4e/0xe0
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428608] [<c0265940>] ? ext4_da_get_block_prep+0x0/0x120
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428610] [<c0268c18>] ext4_da_write_begin+0x168/0x300
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428613] [<c0265940>] ? ext4_da_get_block_prep+0x0/0x120
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428615] [<c01b26c9>] generic_perform_write+0xa9/0x190
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428618] [<c01b33fd>] generic_file_buffered_write+0x5d/0x150
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428621] [<c01b4da3>] __generic_file_aio_write_nolock+0x1f3/0x510
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428625] [<c03b918a>] ? scsi_next_command+0x3a/0x50
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428628] [<c01b51d9>] generic_file_aio_write+0x59/0xd0
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428632] [<c025f6fc>] ext4_file_write+0x4c/0x180
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428635] [<c01e765c>] do_sync_write+0xbc/0x100
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428638] [<c015c280>] ? autoremove_wake_function+0x0/0x40
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428642] [<c014b405>] ? __do_softirq+0xe5/0x1a0
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428645] [<c018f8fc>] ? handle_IRQ_event+0x4c/0x140
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428649] [<c02c7fbf>] ? security_file_permission+0xf/0x20
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428651] [<c01e77ff>] ? rw_verify_area+0x5f/0xe0
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428654] [<c048e176>] ? sys_recv+0x36/0x40
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428656] [<c01e791a>] vfs_write+0x9a/0x190
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428659] [<c01e75a0>] ? do_sync_write+0x0/0x100
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428662] [<c0104f10>] ? do_IRQ+0x50/0xc0
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428664] [<c01e84f3>] sys_pwrite64+0x63/0x80
Dec 9 11:40:59 MeinFileServer kernel: [ 6360.428667] [<c010336c>] syscall_call+0x7/0xb

i hope this may help you

good luck

ProblemType: Bug
Architecture: i386
Date: Wed Dec 9 13:21:37 2009
DistroRelease: Ubuntu 9.10
ExecutablePath: /usr/bin/gnome-system-log
NonfreeKernelModules: nvidia
Package: gnome-utils 2.28.1-0ubuntu1
ProcEnviron:
 LANG=de_DE.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.31-16.52-generic
SourcePackage: gnome-utils
Uname: Linux 2.6.31-16-generic i686
XsessionErrors:
 (gnome-settings-daemon:2874): GLib-CRITICAL **: g_propagate_error: assertion `src != NULL' failed
 (gnome-settings-daemon:2874): GLib-CRITICAL **: g_propagate_error: assertion `src != NULL' failed
 (polkit-gnome-authentication-agent-1:3158): GLib-CRITICAL **: g_once_init_leave: assertion `initialization_value != 0' failed
 (nautilus:3113): Eel-CRITICAL **: eel_preferences_get_boolean: assertion `preferences_is_initialized ()' failed
 (gnome-panel:2953): Gdk-WARNING **: /build/buildd/gtk+2.0-2.18.3/gdk/x11/gdkdrawable-x11.c:952 drawable is not a pixmap or window

Revision history for this message
Okkulter (hans1-oelmann) wrote :
affects: gnome-utils (Ubuntu) → samba (Ubuntu)
Revision history for this message
Okkulter (hans1-oelmann) wrote :
Download full text (6.0 KiB)

hi there,

here are two new tasks blocking for more than 120 seconds. but this time the task "kjournald2" followed dy "rsync" are affected. this happens when i tried to copy from one ext4 volume to an other with rsync.
So its not samba - specific as i assumed at first let me say it seems to be a problem with ext4.

Since i converted two of my volumes from ext3 to ext4 in the last week i got this problems.
So i have to go back to ext3 to avoid loosing of data.

Dec 10 22:48:36 MeinFileServer kernel: [36480.412032] INFO: task kjournald2:1064 blocked for more than 120 seconds.
Dec 10 22:48:36 MeinFileServer kernel: [36480.412036] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 10 22:48:36 MeinFileServer kernel: [36480.412038] kjournald2 D c08145c0 0 1064 2 0x00000000
Dec 10 22:48:36 MeinFileServer kernel: [36480.412042] e5983ec0 00000046 5b337856 c08145c0 f69f2848 c08145c0 c477cdf8 000020fc
Dec 10 22:48:36 MeinFileServer kernel: [36480.412047] c08145c0 c08145c0 f69f2848 c08145c0 c477c209 000020fc c08145c0 e42b9180
Dec 10 22:48:36 MeinFileServer kernel: [36480.412052] f69f25b0 e5983f50 f646c478 e5983f44 e5983f74 c029a551 e5983ee4 c03154d4
Dec 10 22:48:36 MeinFileServer kernel: [36480.412057] Call Trace:
Dec 10 22:48:36 MeinFileServer kernel: [36480.412064] [<c029a551>] jbd2_journal_commit_transaction+0x161/0xe80
Dec 10 22:48:36 MeinFileServer kernel: [36480.412069] [<c03154d4>] ? rb_erase+0xb4/0x120
Dec 10 22:48:36 MeinFileServer kernel: [36480.412074] [<c01332b1>] ? __dequeue_entity+0x21/0x40
Dec 10 22:48:36 MeinFileServer kernel: [36480.412077] [<c0150417>] ? lock_timer_base+0x27/0x50
Dec 10 22:48:36 MeinFileServer kernel: [36480.412080] [<c015c280>] ? autoremove_wake_function+0x0/0x40
Dec 10 22:48:36 MeinFileServer kernel: [36480.412082] [<c0150485>] ? try_to_del_timer_sync+0x45/0x50
Dec 10 22:48:36 MeinFileServer kernel: [36480.412086] [<c029f9ee>] kjournald2+0xce/0x200
Dec 10 22:48:36 MeinFileServer kernel: [36480.412088] [<c015c280>] ? autoremove_wake_function+0x0/0x40
Dec 10 22:48:36 MeinFileServer kernel: [36480.412091] [<c029f920>] ? kjournald2+0x0/0x200
Dec 10 22:48:36 MeinFileServer kernel: [36480.412094] [<c015bf8c>] kthread+0x7c/0x90
Dec 10 22:48:36 MeinFileServer kernel: [36480.412096] [<c015bf10>] ? kthread+0x0/0x90
Dec 10 22:48:36 MeinFileServer kernel: [36480.412099] [<c0104007>] kernel_thread_helper+0x7/0x10
Dec 10 22:48:36 MeinFileServer kernel: [36480.412117] INFO: task rsync:14130 blocked for more than 120 seconds.
Dec 10 22:48:36 MeinFileServer kernel: [36480.412118] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 10 22:48:36 MeinFileServer kernel: [36480.412120] rsync D c08145c0 0 14130 14129 0x00000000
Dec 10 22:48:36 MeinFileServer kernel: [36480.412123] dd297ba0 00200086 dd2d2000 c08145c0 f6abc168 c08145c0 61295f12 000020fc
Dec 10 22:48:36 MeinFileServer kernel: [36480.412128] c08145c0 c08145c0 f6abc168 c08145c0 00000000 000020fc c08145c0 f641ddc0
Dec 10 22:48:36 MeinFileServer kernel: [36480.412133] f6abbed0 dd297bd4 f6abbed0 c87cc508 dd297bcc c0570b75 fffeffff 00000304
Dec 10 22:48:36 MeinFileServer kernel: [364...

Read more...

Chuck Short (zulcss)
affects: samba (Ubuntu) → linux (Ubuntu)
Okkulter (hans1-oelmann)
summary: - Samba blocked for more than 120 seconds.
+ "Smb","kjournald2" and "rsync" blocked for more than 120 seconds.
summary: - "Smb","kjournald2" and "rsync" blocked for more than 120 seconds.
+ "Smbd","kjournald2" and "rsync" blocked for more than 120 seconds.
summary: - "Smbd","kjournald2" and "rsync" blocked for more than 120 seconds.
+ "Smbd","kjournald2" and "rsync" blocked for more than 120 seconds
+ while using ext4.
Andy Whitcroft (apw)
tags: added: karmic
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Surbhi Palande (csurbhi) wrote :

@ Okkulter, can you run appport-collect to attach logfiles with this bug. A complete output of dmesg when you hit the bug would be helpful. Can you also try the latest kernel at: https://launchpad.net/~stefan-bader-canonical/+archive/karmic/+packages to see if it helps ?

Surbhi Palande (csurbhi)
Changed in linux (Ubuntu):
assignee: nobody → Surbhi Palande (csurbhi)
Revision history for this message
Alvin (alvind) wrote :

Bug 276476 states that we should file separate bugs for the blocked processes (if I interpret the comments correctly). Can we instead dump everything here?

The easiest way to reproduce this is just running rsync or (s)cp with a big file. I see mainly blocked kvm, pdflush and kjournald in kern.log before the server goes down and I have to find a reset button. (doesn't always go down, but I disabled the rsync backup for now.)

Revision history for this message
tom (thomas-gutzler) wrote :

Hi,
Is this still alive? Should I post here instead: https://bugs.launchpad.net/bugs/276476 ?

I've been getting the same errors (INFO: task xyz blocked for more than 120 seconds) since I upgraded to ext4 on my 4TB /home partition. It gets a lot of I/O including rsync backups, smbd and nfs fileserver for ~20 people. I have attached a dmesg.

Unfortunately, apport-collect doesn't work for me because the machine doesn't have X and the launchpad logon page doesn't have a selectable [Continue] button in w3m; only cancel works.

# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 9.10
Release: 9.10
Codename: karmic
# uname -a
Linux io 2.6.31-20-server #58-Ubuntu SMP Fri Mar 12 05:40:05 UTC 2010 x86_64 GNU/Linux

Revision history for this message
Okkulter (hans1-oelmann) wrote :

hi there,

As a result of this error (behavior?) i switched back from ext4 to ext3 on my 14 TB softraid volume and no more blocking messages.
==> error with ext4 - definitly

greetings to all who waits for a correction

I still will have a look on this

thx

Revision history for this message
Neil Broomfield (neil-broomfield) wrote :

I think It could be related to this ext4 bug on kernel.org..hopefully an ext4 dev can shed more light on the issue?

https://bugzilla.kernel.org/show_bug.cgi?id=14830

Revision history for this message
Nimlar (nicolas-toromanoff) wrote :

I have (quiet) the same issue twice in 2 days (on karmic)

But I am not using ext4 (only ext3 and jfs)

and the first crash I had, it was trackerd the first blocking task :
(from kern.log)
Apr 19 11:28:43 localhost kernel: [156600.910099] INFO: task trackerd:3486 blocked for more than 120 seconds.
(full kern.log on demand)

The second time it was cron (but I didn't check if the ran script was IO consuming or not)
(from kern.log)
Apr 20 14:20:59 localhost kernel: [68160.710147] INFO: task cron:28148 blocked for more than 120 seconds.
(log of Apr 20 seems to be polluted by the systReq I misused)

No "strange" update the last days, but I enabled a new django site the day before the first crash (in debug mode) => maybe memory whore.

For my specific case (may be different than the one reported here) I suspect 2 causes :
   * hardware failure.
   * or bad behavior of something in case of (low memory + huge IO activity.)

Revision history for this message
Nimlar (nicolas-toromanoff) wrote :

After some more investigations (due to a third crash, and my server unvaillibility) I found something new in the log :

In the same time as all the crashes, I have in my apache2 access.log a lot of concurrent access from the same IP (I think from a bad configured corporate proxy) This may be the root cause

Revision history for this message
tom (thomas-gutzler) wrote :

As suggested by Brian Rogers in bug #276476 I'm posting an update here. Hope to attract some attention!

I'm running a file server that shares home directories and other directories (~4TB, ext4) to 4 linux clients via nfs and about 20 windows and mac clients via samba. Rsync runs several times per day to mirror all files onto a second server. This seems to be the main trigger for the freezes to happen but I've seen it being caused by other I/O intensive operations, too.
Mostly, the file server recovers 'quickly' and the freezes only last for 10 to 60 seconds. Such interruptions occur at least once per hour.
When working on a linux client, everything stops (possibly due to the shared ~) until the file server has recovered. On windows clients the program accessing the share freezes. During that time, the load of the server increases rapidly as more and more processes are queuing up for a read/write.
Rebooting the server helps but the first 120 second block normally happens within several hours and rebooting every time is not an option.
Blocked tasks include pdflush, nfsd and kjournald2.
OS is Ubuntu karmic 9.10 on 2.6.31-21-server #59-Ubuntu SMP Wed Mar 24 08:26:06 UTC 2010 x86_64
All packages updated

Andy Whitcroft (apw)
tags: added: kernel-candidate kernel-reviewed
tags: removed: kernel-candidate
Revision history for this message
Brad Figg (brad-figg) wrote :

@Okkulter,

I understand given your setup you may not be able to do this however, if it were possible for you to test with the latest development kernel that would be helpful.

ISO CD images are available from http://cdimage.ubuntu.com/daily/current/ .

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds .

Changed in linux (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
Lars (lars-taeuber) wrote :

Hi,

seems I have the same problem with a mbuffer process reading from a Soft-RAID6 from 24 SATA drives with XFS on it.
It's our backup system that writes everything (5.9TB) to tapes.

In the end the mbuffer stops with this error:

mbuffer: warning: error during output to /dev/st0: Input/output error

It seems to me the scheduler is the root of the problem.

I'd like to test with upstream kernels but there are no server kernels. I'm running ubuntu 10.04 LTS amd64 server.
Which kernel should I choose?

Lars

Revision history for this message
Lars (lars-taeuber) wrote :

Hi,

I have success with setting the block-io-scheduler to something different than »deadline« for each block device in our raid arrays.
This is with kernel: 2.6.32-24-server

I tried »noop« and it worked fine for me. This is not reproduced yet, because a backup lasts more than 12 hours each. Today I'll test cfq and report.

To read the current or set the scheduler for each block device through /sys/block/sd*/queue/scheduler.
On the other hand you can set it globally with the boot option: »elevator=cfq« (or »noop« or »anticipatory«)

Good luck
Lars

Revision history for this message
tom (thomas-gutzler) wrote :

Lars,
by "have success setting" you mean "avoid the freezes"?
I'm also running 2.6.32-24-server (but with an adaptec hardware raid) and I haven't had a "task blocked for more than 120 seconds." message in my kern.log since Jul 8.
However, I think the freezes are still happening just not for 120 seconds, so no log is written. Will keep an eye open and try noop.
Cheers

Revision history for this message
Lars (lars-taeuber) wrote :

Hi Tom,

yes I haven't had this »freezes« with noop.

With our setup it was reproducable. After some (3-4) hours of heavy io-load from the hdds our process gets blocked for more than 120s and then it stopps itself because it couldn't write to the tape drive any more.

With noop the backup (actual two of them in parallel) has run 13 hour without any strange log entry.

Lars

Revision history for this message
Lars (lars-taeuber) wrote :

Hi everybody.

The test with cfq was successful too.
In our setup and with our purpose the noop was a bit faster. (8:19 hours instead of 8:49 hours)

So with a server kernel don't use deadline if you get blocking processes. This seems the solution.

Good luck.
Lars

Revision history for this message
Alvin (alvind) wrote :

In my case removing all LVM snapshots prevented the errors (and the enormous load). I left the scheduler at deadline.

Revision history for this message
Lars (lars-taeuber) wrote :

Hi!

Last night the backup was interrupted again.
Our backup process timed out while waiting to be scheduled:

[71881.750043] INFO: task mbuffer:8637 blocked for more than 120 seconds.
[71881.750531] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[71881.751090] mbuffer1 D 0000000000000000 0 8637 4348 0x00000000
[71881.751094] ffff8801ef339c18 0000000000000086 0000000000015bc0 0000000000015bc0
[71881.751098] ffff880405c231a0 ffff8801ef339fd8 0000000000015bc0 ffff880405c22de0
[71881.751101] 0000000000015bc0 ffff8801ef339fd8 0000000000015bc0 ffff880405c231a0
[71881.751104] Call Trace:
[71881.751115] [<ffffffff81558e2d>] schedule_timeout+0x22d/0x300
[71881.751120] [<ffffffff812b5377>] ? kobject_put+0x27/0x60
[71881.751123] [<ffffffff8155a985>] ? _spin_lock_irq+0x15/0x20
[71881.751128] [<ffffffff8138b63a>] ? scsi_request_fn+0xda/0x5e0
[71881.751131] [<ffffffff815580d6>] wait_for_common+0xd6/0x180
[...]

This was with »cfq«. This happened at about 80% backup done.
I switch back to »noop« and report again.

Lars

PS: We have no snapshots. There is no lvm device at all on this system. (only SW-RAID6)

Revision history for this message
Lars (lars-taeuber) wrote :

Hi,

»noop« worked for the last two days now. It seems stable.

Lars

Revision history for this message
Lars (lars-taeuber) wrote :

Hi,

bad news.
The backup stopped due to timeout.
There was no blocking for 120s but there still is some blocking.
The probability to happen seems reduced when using noop as io scheduler.

Is there someone tracking this down? I'd like to help to get this debugged and resolved.

Best regards
Lars

Revision history for this message
Jim Tarvid (tarvid) wrote :

Happening to me too. Several times on the 11th and 12th. May be triggered by load from outside. Looks like a synflood after an incident yesterday. Added mod_evasive, blocked a few bad bots, Reduced maxclients to 50 (prefork). No events since last night. https://www.bijk.com/p/2199b5ea shows yesterday's outage which will roll off tomorrow. I can grant access to someone who wants a closer look.

In my case, I suspect the events are associated with external web server load and are accompanied by dysfunction.This is my only server which gets significant traffic. The others are operating peacefully.

LAMP is my bread and butter. This is the "proof of the blood".

Revision history for this message
Lars (lars-taeuber) wrote :

I suspect the probability to occur depends on the cpu load.
Here on my server is (nearly) no network traffic except two mostly idle ssh sessions.

Lars

Revision history for this message
Lars (lars-taeuber) wrote :

Hi!

The medium importance might be correct for desktops. Desktops commonly start once per day.

But for servers this is most important, because they don't work as expected.
Is there someone working on this?

Where is the focus for ubuntu: servers or desktops?

I'd like to help. But someone has to give instructions. My servers are updated daily.

Lars

Revision history for this message
Lars (lars-taeuber) wrote :

Hi,

the kernel update from last week has made our troubles worse.
I have not managed to get one backup without the 120s blocking since the update.
Althout I didn't change anything else. Still using noop as io-scheduler.

I'll open a new bug report when no one is responding here.
Sorry but this is most important to us.

I don't think there is anything special with our hardware or setup:

- dual Quad-Core AMD Opteron(tm) Processor 2378 (8 cores)
- 32GB Ram
- SAS: 08:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)
           09:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)
- SCSI: 02:08.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev c1)

- each SAS serves 24 SATAs HDDs: WDC WD3000HLFS-01G6U1 (Raptor)
- SCSI servers HP tape library with two LTO-3 Ultrium 3-SCSI drives

best regards
Lars

Revision history for this message
Leon (leonbo) wrote :

Any news on this issue? I'm having the same error messages when using tar. Two hours before I was on 8.04: no problems

Revision history for this message
Lars (lars-taeuber) wrote :

Hi there!

News!

We had no failure since 20 days!
This seems to be because we reduced the filesystem usage from 99% down to 70%. This is a XFS filesystem:

From:
Filesystem Size Used Avail Use% Mounted on
/dev/md3 6.1T 6.0T 65G 99% /backup1
/dev/md2 6.1T 5.8T 288G 96% /backup2

To:
Filesystem Size Used Avail Use% Mounted on
/dev/md3 6.1T 4.2T 1.9T 70% /backup1
/dev/md2 6.1T 1.9T 4.2T 32% /backup2

What about the others? It looks XFS related.

Regards.
Lars

Revision history for this message
Charlie Kravetz (cjkgeek) wrote :

If there are no server kernels to test, there should be enough information for the developers to work towards resolving this issue. For all reporters, it would be good to file individual bugs for each system with issues. The bugs can be filed against "linux" as the package, using "ubuntu-bug linux". Please add this bug 494476 to the comments for reference. The individual bugs will give the developers the needed log files and allow them to decide if all of the fixes will be the same.

Changed in linux (Ubuntu):
assignee: Surbhi Palande (csurbhi) → nobody
status: Incomplete → Triaged
Revision history for this message
Lars (lars-taeuber) wrote :

Bad news.

After the last update it happened again.

I filed a new bug report: bug 652812

Lars

Revision history for this message
Leon (leonbo) wrote :

My problem was a broken second disk in a raid 1 array. No problems anymore!

Revision history for this message
Andrew Schulman (andrex) wrote :

This is a very serious problem. It's causing unpredictable lockups on my server every 2-3 days, requiring a force-reboot. There are many related reports from other users: e.g. #588046, #667656, #628530, and my particular one, #684654.

This bug is not:

* Just on server kernels. My kernel is the latest -generic.
* File-system specific. Here it's on ext4, mine is on reiserfs, others have reported it on xfs.
* A hardware problem, comment #29 notwithstanding. My hardware shows no signs of failure, and too many people are reporting it for it to be caused by simultaneous hardware problems.

I hope that someone is going to work on and fix this very soon.
Andrew.

Revision history for this message
Lars (lars-taeuber) wrote :

Hi!

I vote for a much higher priority too! It's server related and that's why really important.

This problem seems caused by high io load. Maybe related only to filesystem io, maybe irq/sec or pure scsi io.

I'd be happy to test patches to debug an track this down.

Lars

Revision history for this message
Brad Figg (brad-figg) wrote : Unsupported series, setting status to "Won't Fix".

This bug was filed against a series that is no longer supported and so is being marked as Won't Fix. If this issue still exists in a supported series, please file a new bug.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.