problem about diskless booting

Bug #181258 reported by linsson
42
This bug affects 5 people
Affects Status Importance Assigned to Milestone
initramfs-tools (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

hi,
I am trying to install kubuntu 7.10 (not updated, just the distro as it comes from the installation cd) using Diskless booting from a remote system which stores the kernel and the filesystem that will be used on other clients. I followed this howto:
https://help.ubuntu.com/community/DisklessUbuntuHowto#head-320eeefb87afe42faa400af457bd455bec59d7ef
but when I tried to boot from the client, the process stopped due to some kernel panic issue

ipconfig: eth0: SIOCGIFINDEX: No such device
ipconfig: no devices to configure
/init: .: 1: Can't open /tmp/net-eth0.conf
[ 25.445685] Kernel panic - not syncing: Attempted to kill init!

Googling around I found similar problems from other *ubuntu users and it seemed to be a problem related to NetworkManager
I applied the workaround described in the link above

"
      Note: For Ubuntu 7.04 (Feisty Fawn) it seems the /etc/network/interfaces needs a little tweak, in order *not* to have the NetworkManager fiddle with the interface since it's already configured (see also bug #111227 : "NFS-root support indirectly broken in Feisty")

so, here's what your interfaces file should look like:

# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface, commented out for NFS root
#auto eth0
#iface eth0 inet dhcp
iface eth0 inet manual
"

but it did not work. than I tried another trick described in http://wiki.ubuntu.org.cn/UbuntuHelp:Installation/OnNFSDriveWithLocalBoot

"
There is currently a bug in NetworkManager that means even though the machine gets an IP just fine from the kernel booting the network IF goes down and up again. This is bad, we don't want the network to go down while trying to read files over the network! A quick hack is...

vim etc/network/interfaces
Comment out the lines for eth0 (Or whatever your ethernet card is called)

We also need to do this to stop NetworkManager messing with the network card. Bit of a hack...

vim etc/default/NetworkManager
vim etc/default/NetworkManagerDispatcher
Add a line with "exit" to both files

"

but nothing changed and I got again the same error. I even looked up at bug #111227 : "NFS-root support indirectly broken in Feisty" and Bug #92338 in network-manager (Ubuntu)
I am not an expert; I understood that the problem could be caused by avahi, DHCDBD, NetworkManager and NetworkManagerDispatcherand daemons.
I didn't know how to make the clients not to start avahi and dhcdbd so I tried to disable them by disabling them an the server with sycvconfig and then copying everything on /nfsroot (where the distributed stuff lies); my hope was to copy the files overwritten with sysvconfig in the client configuration stuff.
Then i put an "exit" record in the (created new) files
/nfsroot/etc/default/NetworkManager
/nfsroot/etc/default/NetworkManagerDispatcher
and added an "exit 0" at the beginning of
/nfsroot/etc/dbus-1/event.d/25NetworkManager
/nfsroot/etc/dbus-1/event.d/26NetworkManagerDispatcher

again nothing changed.

Could you please give me some help?
Thanks in advance

Giuseppe

Revision history for this message
Steven McCoy (dsbunny) wrote :

The kernel panic indicates it cannot find an ethernet adapter. I'd try to fix that symptom first.

Can you boot off a live-CD and check that networking works there. How are you building initramfs? Try adding the module name for your NIC to /etc/initramfs-tools/modules, rebuilding initramfs (update-initramfs -c -k all), and see if that helps.

Do you have a eth0 or are booting of an eth1 or similar?

Revision history for this message
linsson (aprea-giuseppe) wrote :

Thanks for answering.

I did not built initrams because this step was not in the howto.
The client is exactly the same machine as the server, I am only making a test. I know both network cards work fine (at list when used with their local root filesystem).

As for your last question I am not sure I understand you. These machines only have one network card and the device is called eth0 everywhere. I never used eth1, eth2.... in any step of the diskless procedure. The client is doing a pxe lan boot enabled in its bios.

anyway i am going to try your suggestion and include the nic module in the initrams.
I guess I have to make the inclusion before I copy the whole filesystem on /nfsroot, isn't it?
anyway could you please explain why the stuff now present on the server in the nfs dir (which i call /nfsroot) works fine for local boot on the client and requires this inclusion to work via nfs from the server? once the client receives his ip address and loads the vmlinuz end initrrd file I thought it also loaded the root filesystem, the same which worked fine as a local filesystem!

Revision history for this message
Thomas Engelhard (th-engelhard) wrote :

Hi,

i have the same problem.
Rebuilding the initramfs with the nic did not solve the problem :-(

cu
Thomas

Revision history for this message
jgcb (jens-g) wrote :

Hello,

I can report the same problem for a diskless client with two NICs.
During booting the diskless client, one time the NIC1 becomes eth0 and another time NIC2 becomes eth0.
The result is, that booting in case one works right and in case two the system hangs because of not getting an IP.
This error happens before the root file system via NFS is attached. So it can be either a problem of PXElinux or of a buggy initramfs which is built through initramfs-tools. I checked ubuntu versions 8.04-Desktop-amd64 and the new 8.10-server-amd64. The problem happens on both versions.
Does anybody has experiences in building initrd for diskless clients with multiple NICs and can report on which linux distribution or initramfs-tools version it works?

Revision history for this message
Benjamin Lessani (d-contact-sonassi-com) wrote :

For those looking for a resolution to the issue #jgcb highlited, see here:

http://lists.debian.org/debian-live/2009/02/msg00088.html

Mirror: http://pastebin.com/f4b7d67b

Revision history for this message
xteejx (xteejx-deactivatedaccount) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. You reported this bug a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue for you. Can you try with the latest Ubuntu release? Thanks in advance.

Changed in initramfs-tools (Ubuntu):
status: New → Incomplete
Revision history for this message
Anna Jonna Armannsdottir (annaj) wrote :

I can report the same problem. A number of diskless clients go flawlessly through a PXE boot, then the net interface is brought down when the Linux kernel boots and suddenly, dhcp does not work anymore.

Background research on the network: The clients are connected to Cisco switches. The configuration of the switches makes them run spanning-tree calculations. This usually takes some seconds but for large networks it can take up to 30 40 seconds.
The network port on the switch is turned on, when this calculation has been finished. This is primarily a security measure.
See also: http://networkers-online.com/blog/2008/08/what-is-bpdu-filter/
If this security measure is turned off, the client boots without problems. :)

To debug the problem, the following boot line was used:

  kernel amd64/vmlinuz
  append ro initrd=amd64/initrd.img nbdport=2000 debug=100 break=1 ip=dhcp

In my case, this causes a timeout as can be seen on the attached screenshots. The kernel loads the NIC module e1000e and reports the type of NIC et.c. The kernel reported is 5.927 .

The debug shows the following:

configure_networking
[ -n eth0 ]
 [-e /tmp/net-eth0.conf ]
ipconfig -t 60 eth0

At about 6.4 in kernel time, the NIC module reports further about it's IRQ configuration.
Two seconds (2 sec) later at about 8.56 the NIC module reports that the Link is up at 100 Mbps full duplex.
Some milliseconds later it signals ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready .

At this point in time, ipconfig is just waiting for an answer to the DHCP-discover that really was choked by the switch.
Now this is an obvious flaw in the script. It would be much more reasonable to try repeatedly with increasing timouts.

may I suggest the following timeouts:
4
8
16
30
60

The sum of this timeout is about 120 seconds.

Revision history for this message
Anna Jonna Armannsdottir (annaj) wrote :

And also this screenshot

Revision history for this message
Anna Jonna Armannsdottir (annaj) wrote :

I made an untested patch.

tags: added: patch
Revision history for this message
Anna Jonna Armannsdottir (annaj) wrote :

I tested the patch today on a number of computers running in the computer LAB at the Unuversity of Iceland. The patch really solves the problem.
The patch also reveals another minor problem:
the ipconfig program reports that it is giving up at each timeout, when it has actually done the right job.
It is probably a bug in ipconfig.

Another thing about this patch: the booting process is much faster. :)

Revision history for this message
xteejx (xteejx-deactivatedaccount) wrote :

Upgrading status to Triaged. Thank you for the patch, I will subscribe Ubuntu Sponsors to see if this can be sponsored and included into Ubuntu as a fix. Thank you.

Changed in initramfs-tools (Ubuntu):
importance: Undecided → Medium
status: Incomplete → Triaged
Revision history for this message
xteejx (xteejx-deactivatedaccount) wrote :

Sponsors Team subscribed. Patch above.

Revision history for this message
Philip Muškovac (yofel) wrote :

I tried to apply the patch in lucid and it failed as the function was changed since jaunty. While at it I checked what the upstream code for this was and found that this seems to have been fixed in debian a while ago. So it would be nice if someone could test the upstream code too and see if it resolves the issue. I copied it to a pastebin:

http://paste.ubuntu.com/427631/

tags: added: patch-needswork
Philip Muškovac (yofel)
tags: removed: patch
Revision history for this message
Sebastien Bacher (seb128) wrote :

setting to incomplete and unsubscribing the sponsors for now, it needs testing on lucid and the change needs to be updated if it's still required

Changed in initramfs-tools (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
Dean Montgomery (dmonty) wrote :

I was able to fix the bug on lucid by running the ipconfig inside a for-loop following Anna Jonna Armannsdottir suggestion above.

I've attached an updated file that goes in /usr/share/initramfs-tools/scripts/functions
You may want to merge it with vimdiff or create proper diff patch.

After adding you file you must rebuild your ramdisk
dpkg-reconfigure linux-image-2.6.32-23-generic

Details about how/why the error occurs:
* All our schools run diskless clients.
* Some of our schools have HP switches with spanning-tree turned on in order to stop network loopbacks.
* Spanning Tree adds additional delay for network re-initialization.
* Kernel loading network driver resets the nic - causing spanning tree delay.
* initramfs' ipconfig fails because network link light is off at the time of execution.
* putting the initramfs' ipconfig inside a for loop resoves the issue.

Revision history for this message
Dean Montgomery (dmonty) wrote :

For those that want to confirm the bug and don't have spanning tree on their network switches...

* Setup workstation for diskless boot.
* Turn on the diskless client...
- Watch the link light as soon as the kernel starts to load.
- You will see the link light turn off as the kernel loads the network driver...
- At the very moment the link light goes off - unplug your network cable for a few seconds.
- This will simulate a spanning-tree type delay that some switches have.

Revision history for this message
melter (termant) wrote :

I am trying to get my 10.04.1 booting. I have included the module into /etc/initramfs-tools/modules which didn't help.

If I modify my /usr/share/initramfs-tools/scripts/functions is command

"mkinitramfs -o initrd.img"

the same as command

"dpkg-reconfigure initrd.img"

as Dean Montgomery suggested?

Revision history for this message
Thomas Gebhardt (th-geb) wrote :

The /scripts/functions patch from Dean Montgomery (#15) saved my day. Thanks!

When I deployed a ubuntu/lucid netboot environment in a multi-media room for a workshop that bug affected me. The e1000e driver module did non complete its dhcp setup (link becomes ready .... giving up) and after 60s I got a kernel panic. I replaced /scripts/functions in the initrd image and it worked like a charm.

The various bug reports in this thread may be caused by different flaws. The presented /scripts/function patch makes the boot procedure more robust in either way. Therefore I'd suggest to include that patch irrespective of this bug report.

Revision history for this message
maximilian attems (maks-debian) wrote :

This is fixed in initramfs-tools from Maverick on. The patch landed in Debian 0.94 and thus got included in later Ubuntu sync.
It is indeed unfixed in Lucid.

Changed in initramfs-tools (Ubuntu):
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.