kvm guests become unstable after a while
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
GROS-PRUGNY | ||
Natty |
Invalid
|
Undecided
|
Unassigned | ||
Oneiric |
Fix Released
|
Undecided
|
Unassigned | ||
qemu-kvm (Ubuntu) |
Invalid
|
Medium
|
Unassigned | ||
Natty |
Invalid
|
Undecided
|
Unassigned | ||
Oneiric |
Invalid
|
Medium
|
Unassigned |
Bug Description
After upgrading to natty's kernel I noticed that my VMs would sometimes become highly unstable, with random guest applications segfaulting and crashing in weird ways. This seems to be more pronounced when running more than one VM at a time. This does not seem to be a hardware issue-- the host is a 6 month old laptop and I ran memtest86 for 12 hours with 18 successful completions and no errors. There is no host instability or messages in dmesg that I could see that would indicate a host problem. Downgrading to the maverick kernel fixes this problem. I have a script that will launch 10 VMs and run some commands:
#!/bin/sh
count=0
while /bin/true ; do
count=$(( $count + 1 ))
echo "RUN $count"
vm-stop -f -p sec
sleep 3
vm-start -s -v -p sec
sleep 15
vm-cmd -c -r -p sec apt-get update
vm-cmd -c -r -p sec apt-get -y --force-yes dist-upgrade
vm-cmd -c -r -p sec apt-get -y --force-yes install chromium-browser
vm-cmd -c -r -p sec apt-get -y --force-yes remove --purge chromium-browser*
vm-cmd -c -r -p sec apt-get -y --force-yes install chromium-browser
vm-cmd -c -r -p sec apt-get -y --force-yes remove --purge chromium-browser*
vm-cmd -c -r -p sec apt-get -y --force-yes install chromium-browser
vm-cmd -c -r -p sec apt-get -y --force-yes remove --purge chromium-browser*
vm-cmd -c -r -p sec apt-get -y --force-yes install chromium-browser
vm-cmd -c -r -p sec apt-get -y --force-yes remove --purge chromium-browser*
vm-cmd -c -r -p sec apt-get -y --force-yes install chromium-browser
vm-cmd -c -r -p sec apt-get -y --force-yes remove --purge chromium-browser*
done
'vm-start' starts 10 VMs via libvirt with snapshotted qcow2 disks, and vm-stop kills them off, discarding the snapshot. 'vm-cmd' will ssh into each machine and run the command for each machine in sequence. The VMs themselves are all pristine and are resnapshotted on each loop iteration. The point of this explanation is to illustrate that while the VMs all start in the same state, they fail differently or sometimes not at all. I am able to reproduce guest instability within 4-5 iterations of this script on a natty kernel. With the maverick kernel it ran for 18 times with no errors (around 8 hours).
For example, with the above, I saw a maverick/i386 guest fail once with:
dpkg: parse error, in file '/var/lib/
'Depends' field, reference to 'libglib2.0-0': error in version: version string is empty
Another time the maverick/i386 failed with:
Processing triggers for man-db ...
dpkg: error processing man-db (--unpack):
subprocess installed post-installation script killed by signal (Segmentation fault)
Errors were encountered while processing:
man-db
A lucid/i386 guest failed another time with:
Processing triggers for python-gmenu ...
Rebuilding /usr/share/
Segmentation fault
dpkg: error processing python-gmenu (--purge):
subprocess installed post-installation script returned error exit status 139
Processing triggers for man-db ...
Errors were encountered while processing:
python-gmenu
There are many other failures....
On my laptop I have an i7 with two cores and 4 hyperthreads per core (this is the default configuration for this machine from the factory and the configuration used to report this bug). I am able to 'disable' hyperthreads in the BIOS, and if I do, I end up with 2 cores and 2 threads per core. In this configuration, I noticed that I don't have to run as many VMs to see the problem. I've seen it with as little as 2 VMs at a time. I mention this as it seems that the issue is exacerbated when the ratio of VMs to CPUs is 1:1 or higher.
I can say for certain that the rc6 and rc7 kernel in natty exhibit the problem, and maverick's does not. I can also say that the natty kernel runs considerably hotter than the maverick kernel, with average temperatures being 10-15C higher underload according to /proc/acpi/
ProblemType: Bug
DistroRelease: Ubuntu 11.04
Package: linux-image-
Regression: Yes
Reproducible: Yes
ProcVersionSign
Uname: Linux 2.6.37-11-generic x86_64
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.23.
Architecture: amd64
ArecordDevices:
**** List of CAPTURE Hardware Devices ****
card 0: Intel [HDA Intel], device 0: CONEXANT Analog [CONEXANT Analog]
Subdevices: 1/1
Subdevice #0: subdevice #0
AudioDevicesInUse:
USER PID ACCESS COMMAND
/dev/snd/
/dev/snd/pcmC0D0p: jamie 2360 F...m pulseaudio
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
Card hw:0 'Intel'/'HDA Intel at 0xf2520000 irq 43'
Mixer name : 'Intel IbexPeak HDMI'
Components : 'HDA:14f15069,
Controls : 16
Simple ctrls : 7
Card29.Amixer.info:
Card hw:29 'ThinkPadEC'
Mixer name : 'ThinkPad EC 6QHT28WW-1.09'
Components : ''
Controls : 1
Simple ctrls : 1
Card29.
Simple mixer control 'Console',0
Capabilities: pswitch pswitch-joined penum
Playback channels: Mono
Mono: Playback [off]
Date: Thu Dec 23 23:22:11 2010
EcryptfsInUse: Yes
HibernationDevice: RESUME=
InstallationMedia: Ubuntu 10.04 LTS "Lucid Lynx" - Release amd64 (20100427.1)
MachineType: LENOVO 5129CTO
ProcEnviron:
LANGUAGE=en_US:en
PATH=(custom, user)
LANG=en_US.UTF-8
LC_MESSAGES=
SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=
RelatedPackageV
SourcePackage: linux
dmi.bios.date: 04/20/2010
dmi.bios.vendor: LENOVO
dmi.bios.version: 6QET44WW (1.14 )
dmi.board.name: 5129CTO
dmi.board.vendor: LENOVO
dmi.board.version: Not Available
dmi.chassis.
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.
dmi.modalias: dmi:bvnLENOVO:
dmi.product.name: 5129CTO
dmi.product.
dmi.sys.vendor: LENOVO
tags: |
added: regression-release removed: regression-update |
Changed in qemu-kvm (Ubuntu): | |
status: | Confirmed → Invalid |
Changed in linux (Ubuntu): | |
assignee: | nobody → Serge Hallyn (serge-hallyn) |
Changed in linux (Ubuntu): | |
assignee: | Serge Hallyn (serge-hallyn) → nobody |
Changed in linux (Ubuntu Natty): | |
status: | Confirmed → Invalid |
Added qemu-kvm task as this is only with kvm quests.