INFO: task dpkg:23317 blocked for more than 120 seconds.

Bug #624877 reported by Thomas
362
This bug affects 91 people
Affects Status Importance Assigned to Milestone
Linux
Expired
Medium
dpkg (Debian)
Fix Released
Unknown
dpkg (Ubuntu)
Fix Released
Undecided
Unassigned
Lucid
Fix Released
Undecided
Unassigned
linux (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

[Impact]

[Fix]

[Test Case]

[Regression Potential]

[Original Report]
I try´d today to update my system with "aptitude update && aptitude dist-upgrade -y"

Every time its stick on

Preparing to replace language-pack-en-base 1:10.04+20100422 (using .../language-pack-en-base_1%3a10.04+20100714_all.deb) ...
Unpacking replacement language-pack-en-base ...

when I try to kill the task with "kill -9 9440" I have still no success.

        ├─screen(22470)─┬─bash(22471)───aptitude(9407)─┬─dpkg(9440)
        │ │ └─{aptitude}(9408)
        │ └─bash(22500)───pstree(9460)

only when I kill 9408 I can interupt the command.

My dmesg ist full of curious messages (see file I attach)

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-image-generic 2.6.32.24.25
Regression: Yes
Reproducible: Yes
ProcVersionSignature: Ubuntu 2.6.32-24.39-generic 2.6.32.15+drm33.5
Uname: Linux 2.6.32-24-generic x86_64
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access /dev/snd/: No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
Date: Thu Aug 26 20:54:17 2010
MachineType: MSI MS-7522
PciMultimedia:

ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.32-24-generic root=UUID=a2eace54-399d-4efe-bbf2-c76c44d2b6ea ro iommu=soft vga=0x317 nomce quiet splash
ProcEnviron:
 LANG=de_DE.UTF-8
 SHELL=/bin/bash
SourcePackage: linux
dmi.bios.date: 01/07/2010
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: V8.8
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: MSI X58 Pro-E (MS-7522)
dmi.board.vendor: MSI
dmi.board.version: 3.0
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 3
dmi.chassis.vendor: MICRO-STAR INTERNATIONAL CO.,LTD
dmi.chassis.version: 3.0
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvrV8.8:bd01/07/2010:svnMSI:pnMS-7522:pvr3.0:rvnMSI:rnMSIX58Pro-E(MS-7522):rvr3.0:cvnMICRO-STARINTERNATIONALCO.,LTD:ct3:cvr3.0:
dmi.product.name: MS-7522
dmi.product.version: 3.0
dmi.sys.vendor: MSI

Revision history for this message
Thomas (t.c) wrote :
Revision history for this message
Thomas (t.c) wrote :

Any Update there?

Cant install / Upgrade my System!!!

[ 1321.499551] INFO: task dpkg:3764 blocked for more than 120 seconds.
[ 1321.499822] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1321.526755] dpkg D 00000000ffffffff 0 3764 3719 0x00000000
[ 1321.526761] ffff8802084fbdb8 0000000000000082 0000000000015bc0 0000000000015bc0
[ 1321.526766] ffff88020860b1a0 ffff8802084fbfd8 0000000000015bc0 ffff88020860ade0
[ 1321.526770] 0000000000015bc0 ffff8802084fbfd8 0000000000015bc0 ffff88020860b1a0
[ 1321.526780] Call Trace:
[ 1321.526791] [<ffffffff811664b0>] ? bdi_sched_wait+0x0/0x20
[ 1321.526799] [<ffffffff811664be>] bdi_sched_wait+0xe/0x20
[ 1321.526809] [<ffffffff815429ef>] __wait_on_bit+0x5f/0x90
[ 1321.526816] [<ffffffff811664b0>] ? bdi_sched_wait+0x0/0x20
[ 1321.526824] [<ffffffff81542a98>] out_of_line_wait_on_bit+0x78/0x90
[ 1321.526834] [<ffffffff81085470>] ? wake_bit_function+0x0/0x40
[ 1321.526842] [<ffffffff81166474>] ? bdi_queue_work+0xa4/0xe0
[ 1321.526849] [<ffffffff8116782f>] bdi_sync_writeback+0x6f/0x80
[ 1321.526857] [<ffffffff81167860>] sync_inodes_sb+0x20/0x30
[ 1321.526865] [<ffffffff8116b332>] __sync_filesystem+0x82/0x90
[ 1321.526873] [<ffffffff8116b419>] sync_filesystems+0xd9/0x130
[ 1321.526881] [<ffffffff8116b4d1>] sys_sync+0x21/0x40
[ 1321.526890] [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b

Thats every time in my logfile!

Revision history for this message
Thomas (t.c) wrote :

the only solution to "kill" the dpkg process is a reboot.
But after it is the same behaviour...

Paketlisten werden gelesen... Fertig
Abhängigkeitsbaum wird aufgebaut
Status-Informationen einlesen... Fertig
Lese erweiterte Statusinformationen
Initialisiere Paketstatus... Fertig
Schreibe erweiterte Statusinformationen... Fertig
Die folgenden NEUEN Pakete werden zusätzlich installiert:
  bc{a}
Die folgenden Pakete werden aktualisiert:
  binutils collectd collectd-core ifupdown landscape-common libfreetype6 libfreetype6-dev libldap-2.4-2 libudev0 libvirt0 libwww-perl linux-image-2.6.32-24-generic linux-libc-dev mountall
  python-apt python-lazr.restfulclient sudo tzdata udev update-manager-core update-manager-text upstart w3m wget
24 Pakete aktualisiert, 1 zusätzlich installiert, 0 werden entfernt und 0 nicht aktualisiert.
Muss 0B/41,4MB an Archiven herunterladen. Nach dem Entpacken werden 356kB zusätzlich belegt sein.
Wollen Sie fortsetzen? [Y/n/?] y
Schreibe erweiterte Statusinformationen... Fertig
Vorkonfiguration der Pakete ...
(Lese Datenbank ... 61712 Dateien und Verzeichnisse sind derzeit installiert.)
Vorbereiten zum Ersetzen von linux-image-2.6.32-24-generic 2.6.32-24.39 (durch .../linux-image-2.6.32-24-generic_2.6.32-24.42_amd64.deb) ...
Done.
Entpacke Ersatz für linux-image-2.6.32-24-generic ...

And there it hang!

Revision history for this message
Abe Froeman (abe-froeman) wrote :

9.10 updated to 10.04 I now get this behavior when trying to apt-get install.

2.6.32-24-generic-pae i686

ps -ux shows /usr/bin/dpkg with a state of Ds

Is this caused by disk locking?

Revision history for this message
Abe Froeman (abe-froeman) wrote :

I've tried to get around this first by doing:
rm /var/lib/dpkg/lock
rm /var/cache/apt/archives/lock
dpkg --configure -a
apt-get install -f

But what I installed previously is now held in an inconsistent state. I can't be apt-get purge'd and it can't be properly installed. So I:
dpkg --remove --force-remove-reinstreq <packagename>

And apt-get install something else instead. But every time it gets to upacking it locks up again.

Revision history for this message
Michael Palmer (mp5-mepalmer) wrote :

Possible hint: I also got this stuck dpkg process when running "apt-get install" inside of "screen". Could not kill it with kill -9. It also stuck on the unpacking step.

(I don't think it matters but I was installing the package "m4".)

After rebooting, I ran the same install without "screen" and it worked.

Revision history for this message
Michael Palmer (mp5-mepalmer) wrote :

P.S. I take it back, I don't think screen is involved... just did another install (not inside of screen) and it hangs.

$ sudo apt-get install texinfo
Reading package lists... Done
Building dependency tree
Reading state information... Done
Suggested packages:
  texlive-latex-base texlive-generic-recommended texinfo-doc-nonfree
The following NEW packages will be installed:
  texinfo
0 upgraded, 1 newly installed, 0 to remove and 1 not upgraded.
Need to get 553kB of archives.
After this operation, 2,666kB of additional disk space will be used.
Get:1 http://us.archive.ubuntu.com/ubuntu/ lucid/main texinfo 4.13a.dfsg.1-5ubuntu1 [553kB]
Fetched 553kB in 1s (290kB/s)
Selecting previously deselected package texinfo.
(Reading database ... 89238 files and directories currently installed.)
Unpacking texinfo (from .../texinfo_4.13a.dfsg.1-5ubuntu1_amd64.deb) ...

[hangs here]

Revision history for this message
Michael Palmer (mp5-mepalmer) wrote :
Download full text (5.5 KiB)

P.P.S. in /var/log/kern.log

Nov 12 16:48:16 evo1 kernel: [ 4316.056166] INFO: task dpkg:11443 blocked for more than 120 seconds.
Nov 12 16:48:16 evo1 kernel: [ 4316.056454] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 12 16:48:16 evo1 kernel: [ 4316.056774] dpkg D 0000000000000000 0 11443 11417 0x00000000
Nov 12 16:48:16 evo1 kernel: [ 4316.056780] ffff880330c1fd38 0000000000000082 0000000000015bc0 0000000000015bc0
Nov 12 16:48:16 evo1 kernel: [ 4316.056785] ffff88032ea51ab0 ffff880330c1ffd8 0000000000015bc0 ffff88032ea516f0
Nov 12 16:48:16 evo1 kernel: [ 4316.056789] 0000000000015bc0 ffff880330c1ffd8 0000000000015bc0 ffff88032ea51ab0
Nov 12 16:48:16 evo1 kernel: [ 4316.056793] Call Trace:
Nov 12 16:48:16 evo1 kernel: [ 4316.056804] [<ffffffff81541b6d>] schedule_timeout+0x22d/0x300
Nov 12 16:48:16 evo1 kernel: [ 4316.056811] [<ffffffff8105da22>] ? enqueue_entity+0x122/0x1a0
Nov 12 16:48:16 evo1 kernel: [ 4316.056815] [<ffffffff81061e14>] ? check_preempt_wakeup+0x1c4/0x3c0
Nov 12 16:48:16 evo1 kernel: [ 4316.056819] [<ffffffff8154178b>] wait_for_common+0xdb/0x180
Nov 12 16:48:16 evo1 kernel: [ 4316.056824] [<ffffffff8105a124>] ? try_to_wake_up+0x284/0x380
Nov 12 16:48:16 evo1 kernel: [ 4316.056828] [<ffffffff8105a220>] ? default_wake_function+0x0/0x20
Nov 12 16:48:16 evo1 kernel: [ 4316.056832] [<ffffffff815418ed>] wait_for_completion+0x1d/0x20
Nov 12 16:48:16 evo1 kernel: [ 4316.056837] [<ffffffff81165e17>] sync_inodes_sb+0x87/0xb0
Nov 12 16:48:16 evo1 kernel: [ 4316.056842] [<ffffffff8116a6a2>] __sync_filesystem+0x82/0x90
Nov 12 16:48:16 evo1 kernel: [ 4316.056845] [<ffffffff8116a789>] sync_filesystems+0xd9/0x130
Nov 12 16:48:16 evo1 kernel: [ 4316.056848] [<ffffffff8116a841>] sys_sync+0x21/0x40
Nov 12 16:48:16 evo1 kernel: [ 4316.056854] [<ffffffff810121b2>] system_call_fastpath+0x16/0x1b
Nov 12 16:50:16 evo1 kernel: [ 4435.862222] INFO: task dpkg:11443 blocked for more than 120 seconds.
Nov 12 16:50:16 evo1 kernel: [ 4435.862468] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 12 16:50:16 evo1 kernel: [ 4435.862783] dpkg D 0000000000000000 0 11443 11417 0x00000000
Nov 12 16:50:16 evo1 kernel: [ 4435.862787] ffff880330c1fd38 0000000000000082 0000000000015bc0 0000000000015bc0
Nov 12 16:50:16 evo1 kernel: [ 4435.862790] ffff88032ea51ab0 ffff880330c1ffd8 0000000000015bc0 ffff88032ea516f0
Nov 12 16:50:16 evo1 kernel: [ 4435.862793] 0000000000015bc0 ffff880330c1ffd8 0000000000015bc0 ffff88032ea51ab0
Nov 12 16:50:16 evo1 kernel: [ 4435.862796] Call Trace:
Nov 12 16:50:16 evo1 kernel: [ 4435.862807] [<ffffffff81541b6d>] schedule_timeout+0x22d/0x300
Nov 12 16:50:16 evo1 kernel: [ 4435.862812] [<ffffffff8105da22>] ? enqueue_entity+0x122/0x1a0
Nov 12 16:50:16 evo1 kernel: [ 4435.862815] [<ffffffff81061e14>] ? check_preempt_wakeup+0x1c4/0x3c0
Nov 12 16:50:16 evo1 kernel: [ 4435.862818] [<ffffffff8154178b>] wait_for_common+0xdb/0x180
Nov 12 16:50:16 evo1 kernel: [ 4435.862822] [<ffffffff8105a124>] ? try_to_wake_up+0x284/0x380
Nov 12 16:50:16 evo1 kernel: [ 4435.862825] [<ffffffff8105a220>] ? default_wake_function+0x0/0x20
Nov 12 16:50:16 ...

Read more...

Brad Figg (brad-figg)
tags: added: acpi-table-checksum
Changed in linux:
status: Unknown → Confirmed
Revision history for this message
Filiprino (filiprino) wrote :

There are some duplicates of this bug and the problem is in a sync call. Instead of executing an upgrade or installing new packages, try to do a sync.
user@computer:~$ sync;

Revision history for this message
Filiprino (filiprino) wrote :

For instance, a duplicate bug is https://bugs.launchpad.net/ubuntu/+source/dpkg/+bug/606341
And an interesting comment of that bug is https://bugs.launchpad.net/ubuntu/+source/dpkg/+bug/606341/comments/10 and subsequent ones.

The bug seems related to heavy disk activity.

Revision history for this message
Aaron Kulick (aaron-kulick) wrote :

I suspect this issue is related to https://bugzilla.kernel.org/show_bug.cgi?id=15426. Should a patch or updated kernel be released, this issue should resolve itself. However, this will require that the patch be included in a subsequent kernel or backported and then packaged and deployed. I propose this issue be escalated and the importance raised due to the potential failure due to high I/O loads.

Revision history for this message
Grant Slater (firefishy) wrote :

Setting the io scheduler to cfq seems to have resolved/workaround the issue for me.

Kernel parameter: elevator=cfq

Revision history for this message
Filiprino (filiprino) wrote : Re: [Bug 624877] Re: INFO: task dpkg:23317 blocked for more than 120 seconds.

I though that the completely fair queuing scheduler was the default one.

Revision history for this message
Grant Slater (firefishy) wrote :

@Filiprino

/boot/config-2.6.32-2 : CONFIG_DEFAULT_IOSCHED="deadline"

Revision history for this message
Grant Slater (firefishy) wrote :

@Filiprino

/boot/config-2.6.32-28-server : CONFIG_DEFAULT_IOSCHED="deadline"

Changed in linux:
importance: Unknown → Medium
Revision history for this message
Alejandro R. Sedeño (asedeno) wrote :

A friend of mine just ran into this and we spent some time debugging it.

Long story short, I stumbled onto an lkml thread that seemed related [1].

Something tytso said in his first reply [2], and the fact that the root fs was ext4, led me to try:

echo 4 > /sys/fs/ext4/sda1/max_writeback_mb_bump

and suddenly the dpkg that had been stuck for at least 30 minutes started doing things again. I wish we had figured it out before giving up on the dpkg that had been stuck since January and forcing a reboot.

[1] http://thread.gmane.org/gmane.linux.kernel/949268/
[2] "So I added a forced override for ext4, which now writes 128MB at a time --- with a sysfs tuning knob that allow the old behaviour to be restored if users really complained."

Revision history for this message
Theodore Ts'o (tytso) wrote :

Does the problem go away in 2.6.37? There's an ext4 commit that may be relevant, which hit mainline in 2.6.37:

5b41d92437 ext4: implement writeback livelock avoidance using page tagging

The fact that changing the IO scheduler from "deadline" to "cfq" is also interesting, given that most of the kernel developers out there use the default IO scheduler of cfq. It's not clear to me why this would be making a difference --- as is why changing the the max_maxwriteback_mb_bump would unjam things. But that suggests to me that it may be a case of livelock, which is why the above-mentioned commit that is in 2.6.37 might make a difference.

For people who want to try cherry-picking this commit back to 2.6.32, you'll also need commit f446daaea by Jan Kara.

Revision history for this message
astrostl (astrostl) wrote :

I also seem to be experiencing this on high-traffic Ubuntu 10.04.2 LTS / 2.6.35-23-server / x64 / NFS/SAN-connected systems.

No ext4. From one host's mounts: binfmt_misc, debugfs, devpts, devtmpfs, ext3, fusectl, nfs, proc, securityfs, sysfs, tmpfs.

It appears to be strongly correlated with heavy external I/O load; during this time, 'sync' will also hang indefinitely.

Revision history for this message
E Carter (ecarter-openplans) wrote :

I am seeing this on a busy server as well. I am running Ubuntu 10.4 in an openvz container on a busy server. Ext3 is the underlying file system and the server itself is quite busy.

Dpkg hangs for at least a good hour on:

/usr/bin/dpkg --status-fd 14 --unpack --auto-deconfigure /var/cache/apt/archives/ifupdown_0.6.8ubuntu29.2_amd64.deb

Revision history for this message
pvledoux (pvledoux) wrote :

Same problem here on Ubuntu 10.04 64
#uname -a
Linux ipf-srv001.ipf.local 2.6.32-28-server #55-Ubuntu SMP Mon Jan 10 23:57:16 UTC 2011 x86_64 GNU/Linux

with 1 external USB disk attached (ext4).

Dpkg is stuck on 'unkaping' and the only way to kill it is to rm /var/lib/dpkg/lock

Any solution (except unmouting the disk)?

Thx

Revision history for this message
Yang (yaaang) wrote :
Download full text (3.5 KiB)

I don't think this has to do with heavy disk activity - I had nothing intensive running, and still don't. dpkg itself is certainly generating zero activity.

This has happened to me twice now, on my Ubuntu 10.04 64 Desktop. Not only can't I sudo pkill -9 dpkg, I can't shut down the machine; I have to power cycle the box. For me, the dpkg command is:

/usr/bin/dpkg --status-fd 46 --unpack --auto-deconfigure /var/cache/apt/archives/google-chrome-unstable_11.0.672.2-r75134_amd64.deb

Things gets stuck on....

$ sudo aptitude full-upgrade
Reading package lists... Done
Building dependency tree
Reading state information... Done
Reading extended state information
Initializing package states... Done
The following packages will be upgraded:
  google-chrome-unstable
1 packages upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Need to get 0B/21.5MB of archives. After unpacking 324kB will be used.
Do you want to continue? [Y/n/?]
Writing extended state information... Done
(Reading database ... 364642 files and directories currently installed.)
Preparing to replace google-chrome-unstable 10.0.648.82-r75062 (using .../google-chrome-unstable_11.0.672.2-r75134_amd64.deb) ...
Unpacking replacement google-chrome-unstable ...

From my dmesg:

[196682.170113] INFO: task dpkg:15510 blocked for more than 120 seconds.
[196682.170119] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[196682.170123] dpkg D 0000000000000000 0 15510 15496 0x00000000
[196682.170132] ffff8800261d7e30 0000000000000082 0000000000015bc0 0000000000015bc0
[196682.170140] ffff880067f483b8 ffff8800261d7fd8 0000000000015bc0 ffff880067f48000
[196682.170147] 0000000000015bc0 ffff8800261d7fd8 0000000000015bc0 ffff880067f483b8
[196682.170154] Call Trace:
[196682.170168] [<ffffffff81544a55>] rwsem_down_failed_common+0x95/0x1f0
[196682.170174] [<ffffffff81544c06>] rwsem_down_read_failed+0x26/0x30
[196682.170182] [<ffffffff812bdc34>] call_rwsem_down_read_failed+0x14/0x30
[196682.170188] [<ffffffff81543ec4>] ? down_read+0x24/0x30
[196682.170194] [<ffffffff8116b117>] sync_filesystems+0xb7/0x130
[196682.170200] [<ffffffff8116b1e7>] sys_sync+0x17/0x40
[196682.170207] [<ffffffff810121b2>] system_call_fastpath+0x16/0x1b
[196802.170737] INFO: task dpkg:15510 blocked for more than 120 seconds.
[196802.170743] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[196802.170747] dpkg D 0000000000000000 0 15510 15496 0x00000000
[196802.170755] ffff8800261d7e30 0000000000000082 0000000000015bc0 0000000000015bc0
[196802.170763] ffff880067f483b8 ffff8800261d7fd8 0000000000015bc0 ffff880067f48000
[196802.170770] 0000000000015bc0 ffff8800261d7fd8 0000000000015bc0 ffff880067f483b8
[196802.170777] Call Trace:
[196802.170791] [<ffffffff81544a55>] rwsem_down_failed_common+0x95/0x1f0
[196802.170798] [<ffffffff81544c06>] rwsem_down_read_failed+0x26/0x30
[196802.170806] [<ffffffff812bdc34>] call_rwsem_down_read_failed+0x14/0x30
[196802.170812] [<ffffffff81543ec4>] ? down_read+0x24/0x30
[196802.170819] [<ffffffff8116b117>] sync_filesystems+0xb7/0x130
[196802.170824] [<ffffffff8116b1e7>] sys_sync+0x17/0x40
[196802.170831] [...

Read more...

Revision history for this message
nacitar sevaht (nacitar) wrote :

I was using apt-mirror to create an ubuntu mirror, decided to install "bmon" and this happened to me. Disk activity seems related, but could be a coincidence.

Revision history for this message
Rich Wohlstadter (rwohlsta) wrote :

We are also being affected by this bug. We roll packages to our cluster which is always under heavy i/o pressure from nfs mounted volumes and the dpkg installs hang for long periods of time due to this sync issue. Hope this can get fixed since we are relying more on dpkg for updates to our cluster.

Revision history for this message
Robin Duckett (robin-duckett) wrote :

Same problem, ubuntu 10.10 maverick, x86.

100% repeatable steps:

Plug in external drive (in my case, activated USB storage on my Nexus One, and plugged it in, and opened the drive with nautilus, no other file transfer other than however nautilus queries the drive I suspect)
Attempt to install something with apt
Watch as dpk gets mode D on it's process (uninterruptable)
Reboot in order to use apt again, because process will not quit even if you unmount / unplug the device as dpkg retains it's locks/pids and apt refuses to run again until reboot.

This is rediculous, can this just be fixed?

Revision history for this message
astrostl (astrostl) wrote :

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=584254

Rich W ran into a thread on dpkg sync behavior, which led to that bug. dpkg's fsync() call was changed to a sync() call, in order to improve ext4/btrfs performance. Those of us with busy, completely unrelated filesystems (such as heavy NFS traffic) know how long a 'sync' can take. dpkg version 1.15.8.6 or later has a --force-unsafe-io option to disable the sync() call, and it is alleged to still be safe on ext3 systems per http://lists.debian.org/debian-dpkg/2010/11/msg00034.html .

The 11.04 Natty package version is 1.16.0, and it installs on 10.04 LTS Lucid if the liblzma and xz-utils packages from Natty are also fetched - no need for updating libc, or anything else base. The dpkg --version output of this is 1.5.8.10ubuntu1, not 1.16.0, but it DOES have the unsafe-io option. We will likely roll this to all of our 10.04 LTS systems, and I will report back.

Revision history for this message
astrostl (astrostl) wrote :

Minor update: dpkg-dev and libdpkg-perl from Natty are also needed if you have, say, build-essentials installed.

Total package list for an (EXPERIMENTAL) update -

dpkg_1.16.0~ubuntu6_amd64.deb
dpkg-dev_1.16.0~ubuntu6_all.deb
libdpkg-perl_1.16.0~ubuntu6_all.deb
liblzma2_5.0.0-2_amd64.deb
xz-utils_5.0.0-2_amd64.deb

Revision history for this message
astrostl (astrostl) wrote :

"force-unsafe-io" can be added to /etc/dpkg/dpkg.cfg to make it a default (hopefully for apt invocations too)

Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
A S (zephyr707) wrote :

Hello,
I am currently experiencing this issue and reading through the bug comments I am not quite sure how best to resolve it without damaging anything. It seems like this will be fixed in a future release (natty), but is there a resource or instructions available for troubleshooting and/or fixing this in 10.10? I'm not sure what the best course of action is for killing dpkg.
thanks for any help,
a

Revision history for this message
astrostl (astrostl) wrote :

FWIW, here is the command set for what I'm doing on my Lucid systems:

# make it so that lucid packages have default priority over natty ones
cat << EOF > /etc/apt/preferences.d/prefsources
Package: *
Pin: release a=natty
Pin-Priority: -10

Package: *
Pin: release a=lucid
Pin-Priority: 900
EOF
# create a natty sources list based on the existing lucid list
grep ^deb /etc/apt/sources.list|sed 's/lucid/natty/' > /etc/apt/sources.list.d/natty.list
# double the default apt cache limit to allow for more/larger repositories
echo 'APT::Cache-Limit "50331648";' > /etc/apt/apt.conf.d/98cache-limit
# issue an apt-get update to cache the new repository
apt-get update
aptitude -y install dpkg/natty
echo "force-unsafe-io" > /etc/dpkg/dpkg.cfg.d/force-unsafe-io

As for the best course of action on killing dpkg, it goes into a D state (uninterruptible sleep) and won't respond to even a kill -9. To get it gone, a couple of your options: wait for all the filesystems to finish syncing (reducing filesystem activity will expedite it), or reboot the system.

Revision history for this message
astrostl (astrostl) wrote :

WARNING: those previous commands are for LUCID, and involve installing packages from NATTY. Buyer beware! Furthermore, if it's run on a non-LUCID system it will result in NATTY packages being installed by DEFAULT unless the Pin-Priority setting is changed from LUCID to whatever the system is actually running.

To downgrade after upgrading: rm /etc/dpkg/dpkg.cfg.d/force-unsafe-io; aptitude -y install dpkg/lucid

To undo repository/apt changes after downgrading: rm /etc/apt/preferences.d/prefsources /etc/apt/sources.list.d/natty.list /etc/apt/apt.conf.d/98cache-limit

Revision history for this message
A S (zephyr707) wrote :

astrostl,
thanks for your instructions on dpkg, i'll wait a bit and then i guess do a hard reset if nothing happens... hopefully things won't be in a defunct state upon restart.
I'm using maverick 10.10 and not lucid (actually, i wish i had gone with the LTS lucid...), and I think I understand how to apply your directions to that version, but I will just let things be for now and just not do anything intensive when running synaptic. I'm not the most advanced linux user, so I'd rather keep things as simple as possible, so I don't mess anything up.
thanks for the help!
a

Revision history for this message
Theodore Ts'o (tytso) wrote :

Note that dpkg 1.15.8.7 and later uses sync_file_range() to help address the sync latency problem. So simply upgrading to dpkg 1.16.0 may be sufficient, without enabling --force-unsafe-io option.

Revision history for this message
Yury V. Zaytsev (zyv) wrote :

Do I get it right that the only immediate workaround is to upgrade dpkg? The magic sys string didn't unstuck things for me. I did't try sync though.

If the only way to fix the bug is to upgrade dpkg, how about rolling out an updated dpkg for maverick?!

Thanks!

Revision history for this message
Leonor Palmeira (leonor-palmeira) wrote :

My system was busy with a very large 'rsync' backup process on an external drive and there was no possibility to 'aptitude safe-upgrade' anything (hanging at unpacking). I killed the 'rsync' process and the 'aptitude safe-upgrade' ran smoothly.

I tested around a bit and noticed that it is really the 'rsync' process that causes the hanging, and not the fact that the external drive is plugged in. As soon as I kill this process, any 'aptitude install' process will immediately finish smoothly even if it was previously hanging.

My 'dpkg' version is 1.15.5.6ubuntu2 (amd64) on Ubuntu 10.04 LTS.

As suggested by Theodore Ts'o, I upgraded to version 1.15.8.10 and this was enough to get rid of this bug (had to install 'xz-utils' for dependency issues).

I suggest to change the Ubuntu 10.04 LTS 'dpkg' version to 1.15.8.7 or above as this is the "Long Term Support" version, and a lot of people will be using it and expecting it to work 'out of the box'.

Revision history for this message
Mike Conigliaro (mconigliaro) wrote :

I believe I'm being affected by the same bug, though I have a different stacktrace in my dmesg log:

[3187763.844945] INFO: task dpkg:25373 blocked for more than 120 seconds.
[3187763.844955] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[3187763.844960] dpkg D 0000000112ff6f6c 0 25373 25363 0x00000000
[3187763.844963] ffff880049f0de18 0000000000000286 0000000000000001 ffff880049f0dd98
[3187763.844966] ffff880049f0dd98 ffff880049f0dde0 ffff880044da89b0 ffff880049f0dfd8
[3187763.844968] ffff880044da8600 ffff880044da8600 ffff880044da8600 ffff880049f0dfd8
[3187763.844971] Call Trace:
[3187763.844979] [<ffffffff81204bcd>] log_wait_commit+0xcd/0x160
[3187763.844984] [<ffffffff81059fb0>] ? autoremove_wake_function+0x0/0x40
[3187763.844987] [<ffffffff81204c91>] ? log_start_commit+0x31/0x60
[3187763.844990] [<ffffffff811a4c3b>] ext3_sync_file+0xeb/0x100
[3187763.844994] [<ffffffff811157d9>] vfs_fsync_range+0x99/0xd0
[3187763.844996] [<ffffffff81115878>] vfs_fsync+0x18/0x20
[3187763.844998] [<ffffffff811158b9>] do_fsync+0x39/0x60
[3187763.845000] [<ffffffff8111590b>] sys_fsync+0xb/0x10
[3187763.845004] [<ffffffff81009ba8>] system_call_fastpath+0x16/0x1b
[3187763.845006] [<ffffffff81009b40>] ? system_call+0x0/0x52

dpkg doesn't hang forever for me. It usually just hangs for 15-20 minutes, then finishes like nothing ever happened. I'm not sure if it matters, but I'm on EC2.

Revision history for this message
Syd Seale (sydseale) wrote :

I've had this problem while installing packages to production webservers (no external drives or nfs volumes mounted). I just now noticed that, at least in this case, stopping apache for a moment allowed sync to finish.

tags: removed: regression-potential
Revision history for this message
astrostl (astrostl) wrote :

Amendment to above instructions (free advice warnings abound :-) )

Using the following:

Package: *
Pin: release n=natty
Pin-Priority: -500

NOTE: n= instead of a=. a= does not capture updates or security, which caused confusion (and a potential problem).

I'm also placing it directly in /etc/apt/preferences rather than in preferences.d, as there is an aptitude bug at https://bugs.launchpad.net/ubuntu/+source/aptitude/+bug/508545 (not applied to Lucid LTS, the SERVER distro...) which causes it to not detect anything in preferences.d.

That should be all that's needed, though - Lucid stuff doesn't need to be bumped, Natty stuff (by name, not archive) just needs to be suppressed. It could technically be any number below 500 (the Lucid/distro default), but making it negative causes it to not update unless explicitly requested.

Revision history for this message
astrostl (astrostl) wrote :

Should note: we haven't had any issues since upgrading to dpkg/natty and adding unsafe-io.

Revision history for this message
Arie Skliarouk (skliarie) wrote :

On an busy lxc server 2.6.38-8-server amd64, I had elevator=cfq in the cmdline, no dpkg was running, yet it got stuck with the same symptoms (see the attached screenshot).

It is notable that the error happened across all dm- devices.

I ran the following command, let's see whether that would help:
for tt in /sys/fs/ext4/*/max_writeback_mb_bump; { echo 4 > $tt; }

Revision history for this message
Marius B. Kotsbak (mariusko) wrote :

Could this bug be a duplicate og bug #570805?

Changed in dpkg (Debian):
status: Unknown → Fix Released
Revision history for this message
Samuel Quiring (sbq) wrote :

I just installed 10.04 AMD 64. I took all the updates. I am getting the error described here (bug #624229). Software I am running requires 10.04, so upgrading is not an option. I've read all the comments here. I see it is fixed, but I do not see what I should do to get the fix?

Revision history for this message
astrostl (astrostl) wrote :

As I no longer use Ubuntu, I don't know about applying the official fix. There are WORKAROUNDS in the thread, though.

Revision history for this message
Allan Beaufour (beaufour) wrote :

These workarounds are all pretty ugly. Is there really not an official fix for this. It happens on Amazon EC2 instances with EBS volumes. I'd vote for a backport to 10.04...

Revision history for this message
Charlie Schluting ☃ (cschluti) wrote :

+1 - this is happening to me as well, on EBS root EC2 instances that do a *lot* of I/O. Please, this needs to be backported.

Revision history for this message
Alex Cunith Rutatinisibwa (alexrutta) wrote :

Please help. I am upgrading my ubuntu in my Intel learning series

Revision history for this message
HorstBort (haukemuentinga) wrote :

I also experience this behaviour on 10.04 with kernel linux-image-generic-lts-backport-natty :

Linux quantus 2.6.38-13-generic #57~lucid1-Ubuntu SMP Tue Mar 6 20:05:46 UTC 2012 x86_64 GNU/Linux

Apparently, Virtualbox is the culprit on my system. I have virtual machines on a ext4 RAID1. When this dpkg became stuck the last time, I was able to track it down to virtualbox keeping jdb2 busy. Once I stopped the VM, sync and dpkg finished normally.

Revision history for this message
Theodore Ts'o (tytso) wrote :

HorstBort --- could you try using VirtualBox on a more modern kernel (ideally v3.3) and see whether VirtualBox is still doing this? And if so, it would be great to strace VirtualBox so we can see what it is doing.

Thanks,

Revision history for this message
Michael Jeanson (mjeanson) wrote :
Revision history for this message
Andreas Raster (rakete) wrote :

I think I just had this problem while running Precise as guest in Virtualbox (4.1.18 on a Precise host). Installation of package linux-header-generic would always hang on:

Unpacking replacement linux-headers-3.2.0-27 ...

while being unable to use Ctrl-C, killall -9 dpkg would kill the process though.

astrostls suggestion to use:

echo "force-unsafe-io" > /etc/dpkg/dpkg.cfg.d/force-unsafe-io

seems to have fixed the issue at least for the moment, I could install linux-header-generic without problems afterwards.

Revision history for this message
Michael Jeanson (mjeanson) wrote :

This patch to lucid's current dpkg package disables the use of sync() and backports the performance fix for fsync() on ext4 that was introduced in 1.15.8.7

This should resolve the problem for people getting "Task blocked" errors without impacting the performance on ext4, it requires more testing, interested parties can get the package from my ppa :

https://launchpad.net/~mjeanson/+archive/ppa/+sourcepub/2598206/+listing-archive-extra

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "dpkg_1.15.5.6ubuntu4.5.debdiff" of this bug report has been identified as being a patch in the form of a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. In the event that this is in fact not a patch you can resolve this situation by removing the tag 'patch' from the bug report and editing the attachment so that it is not flagged as a patch. Additionally, if you are member of the ubuntu-sponsors team please also unsubscribe the team from this bug report.

[This is an automated message performed by a Launchpad user owned by Brian Murray. Please contact him regarding any issues with the action taken in this bug report.]

tags: added: patch
Changed in linux:
status: Confirmed → Expired
no longer affects: linux (Ubuntu Lucid)
Changed in dpkg (Ubuntu):
status: New → Fix Released
Changed in dpkg (Ubuntu Lucid):
status: New → Triaged
Bryce Harrington (bryce)
description: updated
Revision history for this message
Alex Bennée (ajbennee) wrote :

I'm confused as to when this fix was released. I can't see any reference to this bug in the changelogs of either dpkg or the linux-image. I'm seeing on customer 10.04LTS machines running the latest packages which makes me think the bug is still there.

Revision history for this message
Scott Moser (smoser) wrote :

Hi Michael,
  Thank you for your patch attached in comment 50 above. A bot identified this as a fix for lucid, and subscribed ubuntu-sponsors which put this bug on the sponsorship queue at [1].

  You stated:
| This should resolve the problem for people getting "Task blocked" errors
| without impacting the performance on ext4, it requires more testing,
| interested parties can get the package from my ppa :

   Because of "it requires more testing", I've removed ubuntu-sponsors, which means it will be dropped from that queue. At some point in the future, if you believe it has sufficient testing and would like to get this SRU'd, please show evidence of that and re-subscribe ubuntu-sponsors.

   Thanks again for your work.

[1] http://reports.qa.ubuntu.com/reports/sponsoring/

Revision history for this message
Stéphane Graber (stgraber) wrote :

Hey Scott, I've been keeping an eye on that one and am waiting for someone knowledgeable about dpkg to review the actual fix before I sponsor this fix.

Revision history for this message
Alex Bennée (ajbennee) wrote :

Surely the fix required is for the kernel? While dpkg might be good at triggering the bug it may not be the only thing that can hang the kernel requiring a hard reset.

Revision history for this message
astrostl (astrostl) wrote :

My understanding of the bug, backed by the Debian lists, is that the kernel isn't hanging or otherwise behaving improperly - a system-wide sync call is being issued, and the system is waiting for all of the filesystems to sync (as directed).

Revision history for this message
Alex Bennée (ajbennee) wrote :

The behaviour we are seeing is dpkg hangs and never returns. As dpkg is left in the "D" state and is un-killable the only recourse is to hard reset the box (power-cycle) it. However it might be due to file-system load it can never achieve the file-system sync. I'm currently trying to come up with a reliable reproduction that doesn't involve hosing our customer boxes....

Revision history for this message
Theodore Ts'o (tytso) wrote :

There are two issues here, that interact and so they are confusing people. The first is that the kernel has a potential livelock problem in the writeback code, such that if there are constantly new pages dirtied that requires writeback, the sync(2) system call will never return (at least until all of the pages are clean, but on a busy system with lots of processes writing to the disk that could never happen). It doesn't happen all of the time sync(2) is called, but since dpkg was calling sync(2) all the time, it tended to happen there. Still, this problem can happen without dpkg being involved at all, and on many different file systems, since it's a problem with the generic writeback code. Trying to backport this fix to the ancient kernel which is in 10.04 is going to be _hard_. There are people at Red Hat who are paid the big bucks to do this kind of painful backporting (which in this case is multiple patches spread across multiple kernel releases before it was finally fixed, and with all sorts of dependencies). Good luck finding a volunteer willing to figure this out. I wouldn't --- I would much rather run a 3.x kernel. And if I had a business that needed to use a stable enterprise kernel, I'd pay the darned Red Hat or SLES support fees, and get a professionally managed enterprise kernel. Unfortunately, in my experience Canonical doesn't have paid kernel engineers who have either the skill or the bandwidth (not sure which) to do this kind of very tricky backporting to ancient LTS kernels, as compared to what Red Hat has done. I've seen this with ext4 bug fixes which don't get made to 10.04, but which Red Hat has been willing to do for their RHEL6 kernel.

Note that this problem is much less likely to hit on desktop/laptop systems where there generally aren't servers continuously writing to the file system. So for most Ubuntu systems that tend not to be production servers running with highly stressful workloads, this won't be an issue. The people who are complaining on this Launchpad bug are probably outliers, which probably explains the priority paid Canonical engineers have towards doing this kind of backporting.

The second problem/bugfix is the fix to dpkg, which significantly improves both its performance, and the impact on the system as a whole, by using sync_file_range() instead of sync(). Fixing this also tends to remove one of the more common ways of tickling the bug above, but that's not the only reason why backporting this dpkg package would also be a good idea, since it speeds up and decreases the overall system impact of doing package installs.

Or, people could just upgrade their system to Ubuntu LTS 12.04.....

Revision history for this message
Adam Conrad (adconrad) wrote : Please test proposed package

Hello Thomas, or anyone else affected,

Accepted dpkg into lucid-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/dpkg/1.15.5.6ubuntu4.6 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please change the bug tag from verification-needed to verification-done. If it does not, change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in dpkg (Ubuntu Lucid):
status: Triaged → Fix Committed
tags: added: verification-needed
Revision history for this message
Alex Bennée (ajbennee) wrote :

Thanks to Ted for clearly elucidating the two competing issues. For the time being we can't move from 10.04 (whole OS upgrades tend to be unpopular with customers for point releases). However as the kernel lock is always triggered by dpkg I'm hoping that just fixing dpkg will be enough. I assume if you make the system quiescent enough but shutting down all I/O the kernel will eventually un-wedge itself and allow a sync() based dpkg to complete?

Anyway I've tested Michael Jeanson's patch and I can confirm that even under a highly I/O loaded system dpkg now does work it's way (slowly) through a bunch of packages and eventually completes. I'm fairly happy to ship it in our products repositories although of course look forward to upstream finally adopting the fix.

Are there any regression tests I can run against dpkg to make sure nothing else is broken?

Revision history for this message
Alex Bennée (ajbennee) wrote :

So I have verified that the proposed dpkg no longer hangs. I used the following script to generate heavy load on the system:

#!/bin/bash
#
# Generate IO load
echo "starting load"
#dd if=/dev/zero of=zero &
sleep 1s
dd if=/dev/urandom of=urandom &
sleep 1s
dd if=/dev/sda1 of=sda1_backup1 &
sleep 1s
dd if=/dev/sda1 of=sda1_backup2 &
sleep 1s
dd if=/dev/sda1 of=sda1_backup3 &
sleep 1s
dd if=/dev/sda1 of=sda1_backup4 &
echo "All load running"
sleep 10m
echo "10 minutes passed"
sleep 10m
echo "Finishing"
killall dd

And then run a couple of apt-get dist-upgrades with some large packages. Although dpkg ran really slowly (due to the load) I was unable to trigger the kernel D state hang.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package dpkg - 1.15.5.6ubuntu4.6

---------------
dpkg (1.15.5.6ubuntu4.6) lucid-proposed; urgency=low

  * Cherry-pick fixes for sync() behaviour in dpkg (LP: #624877):
    - Disable by default usage of synchronous sync(2), as it causes undesired
      I/O on unrelated file systems. Closes: #588339, #595927, #600075
    - On Linux use sync_file_range() to initiate asynchronous writeback
      of just unpacked files. Suggested by Ted Ts'o <email address hidden>.
      Thanks to Jonathan Nieder <email address hidden>. Closes: #605009
 -- Michael Jeanson <email address hidden> Fri, 14 Sep 2012 09:43:09 -0400

Changed in dpkg (Ubuntu Lucid):
status: Fix Committed → Fix Released
Revision history for this message
Colin Watson (cjwatson) wrote : Update Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Revision history for this message
dino99 (9d9) wrote :

This version is now outdated and no more supported

Changed in linux (Ubuntu):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.