Comment 75 for bug 245779

Revision history for this message
JvA (jvanacht) wrote :

I just started seeing this bug on an NFS server (nfs-kernel-server 1:1.1.2-2ubuntu2.2) on 8.04LTS (2.6.24-27-server #1 SMP Fri Mar 12 01:23:09 UTC 2010 x86_64 GNU/Linux / Ubuntu 2.6.24-27.68-server) with four NFS exports from two iSCSI initiated volumes (open-iscsi 2.0.865-1ubuntu3.3). This NFS server is a virtual machine (VMware ESXi 3.5.0 build 169697) that was setup in October 2009. The NFS server is strictly serving an interim need of offloading data from an old (OpenSuSE 10.0) Samba server's overflowing hard drives.

Up until last week the machine ran without any trouble.

Last week we added the second of the two iSCSI volumes and added an NFS share to the space on that volumes. (All volumes, local disk and iSCSI, are ext3.) We mounted the new NFS volume from the Samba machine and moved about 100GB of data off the old Samba server's local drives via rsync. No problem doing that. We then deleted the data from the Samba server and created symlinks in place of each moved folder, pointing to the respective folder on the new NFS volume.

The 100GB of data we just moved was backup data which the Windows users were backing up to using robocopy.exe. This system had worked just fine for years. But now, nearly every time a robocopy runs, we see the NFS server's kernel hang with the softlockup on 11s error being discussed on this thread. When this happens, the virtual machine is totally unresponsive, and we have to do a hard reset. The other virtual machines on the VMware server do not seem to be impacted in any way.

When we manually drag-and-drop 4GB of data (a typical amount being robocopied by the users) we do not have the problem. This is the first NFS folder which has to handle data being copied (through the Samba server, remember) using robocopy.

I'm no linux kernel developer, but my two cents are that the kernel is seeing a slow response from the iSCSI initiator when a heavy write load is placed on the iSCSI driver and it doesn't respond for a few seconds. After doing some research into this, we are going to try increasing the /proc/sys/kernel/softlockup_thresh from 10 to 60 seconds (the maximum allowed value short of turning off the threshold check) for now and see if that changes anything. If my hypothesis is correct, it likely would.

Perhaps these observations will be of some value among the community and developers in piecing this puzzle together...