MAAS PXE Boot stalls with grub 2.02

Bug #1900668 reported by Michał Ajduk
40
This bug affects 5 people
Affects Status Importance Assigned to Milestone
maas-images
Fix Released
Undecided
Unassigned
grub2 (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

# ENVIRONMENT
MAAS version (SNAP):
  maas 2.8.2-8577-g.a3e674063 8980 2.8/stable canonical✓ -

  MAAS was cleanly installed. KVM POD setup works.

  MAAS status:
  bind9 RUNNING pid 9258, uptime 15:13:02
  dhcpd RUNNING pid 26173, uptime 15:09:30
  dhcpd6 STOPPED Not started
  http RUNNING pid 19526, uptime 15:10:49
  ntp RUNNING pid 27147, uptime 14:02:18
  proxy RUNNING pid 25909, uptime 15:09:33
  rackd RUNNING pid 7219, uptime 15:13:20
  regiond RUNNING pid 7221, uptime 15:13:20
  syslog RUNNING pid 19634, uptime 15:10:48

Servers:
HPE DL380 Gen10 configured to UEFI boot via PXE (PXE legacy mode), Secure boot disabled. All servers (18) experience the described problem.

UEFI Boot menu contains 2 entries alowing one to select the PXE mode:
- HPE Ethernet 1Gb 4-port 366FLR Adapter - NIC (HTTP(S) IPv4)
- HPE Ethernet 1Gb 4-port 366FLR Adapter - NIC (PXE IPv4)

# PROBLEM DESCRIPTION
Similiar to https://bugs.launchpad.net/maas/+bug/1899840

PXE boot stalls after downloading grubx64.efi but before downloading grub.cfg:
2020-10-20 07:18:21 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.216.240.69
2020-10-20 07:18:21 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.216.240.69
2020-10-20 07:18:21 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by 10.216.240.69

Grub drops to the grub prompt.
Within the grub prompt:
- net_ls_addr shows correct IP address
- net_ls_routes shows correct routing
- net_bootps (that should initialize DHCP request from grub) fails with a message: failed to send packet

We've also noticed that in a working scenario grub just after start up but before downloading grub conf sends arp request for MAAS IP:
13517 2020-10-19 13:53:38.864937 HewlettP_02:3d:e8 Broadcast ARP 60 Who has 10.216.240.1? Tell 10.216.240.51
and MAAS replies.

When the boot stalls, one of the symptoms is that grub does not send the ARP request for MAAS IP. It also does not reply to MAAS ARP requests. It looks as if the EFI_NET stack was failing.

# WORKAROUNDS
1) during the the PXE boot send ARP requests from MAAS to query the node IP. This seems to prevent the node from loosing connectivity.

Tested 4 times on independent nodes.

2) Custom built grub:
grub-mkimage -c grub.conf -o grubx64.efi -O x86_64-efi -p /grub normal configfile tftp memdisk boot diskfilter efifwsetup efi_gop efinet ls net normal part_gpt tar ext2 linuxefi http echo chain search search_fs_uuid search_label search_fs_file test tr true minicmd

Grub version: 2.02-2ubuntu8.18

The grub PXE image built in the way described above works on all nodes (18) all the time (4 times tested).

When I've included grub module linux.mod, I've managed to reproduce the described problem.

It seems that the issue can be related to https://savannah.gnu.org/bugs/?func=detailitem&item_id=50715

description: updated
affects: grub (Ubuntu) → grub2 (Ubuntu)
Revision history for this message
David Britton (dpb) wrote :

As this workaround requires doctoring ubuntu images and turning off ubuntu image updates, setting to critical.

Revision history for this message
David Britton (dpb) wrote :

Note also, I expect this does not affect MAAS, so removing that line item, feel free to correct if someone disagrees.

Changed in maas:
status: New → Invalid
tags: added: rls-hh-incoming
Revision history for this message
Dimitri John Ledkov (xnox) wrote :
Revision history for this message
Julian Andres Klode (juliank) wrote :

Does it work if you include any other modules than linux.mod which come to the same overall file size?

Revision history for this message
Stéphane Graber (stgraber) wrote :

I've seen something similar affecting my MAAS setup at home.

In my case, when I hit the grub shell, entering "normal" is enough to get grub to go back to normal and boot. I don't know if it's the same issue but if it is, I have no idea why grub wouldn't have done the right thing in the first place...

If you have access to an interactive console, it'd be interesting for you to try the "normal" trick and see if that gets you back to a booting system too, if not, then I have a different issue :)

Revision history for this message
Stéphane Graber (stgraber) wrote :

Just had another one get stuck, took a few screenshots of the prompt, the output of the commands in the original bug report and the output I'm getting after running the "normal" command.

Revision history for this message
Stéphane Graber (stgraber) wrote :
Revision history for this message
Julian Andres Klode (juliank) wrote :

Marking this as incomplete until we have access to reproducers or someone committed to answering questions.

- Does it fail with other images that contain linux.mod but are smaller
- Does it fail with other larger images that don't contain linux.mod
- Does it fail with 2.04 in focal as well?

@stgraber: Does net_bootp (or net_dhcp) produce the same issue for you? You're on 2.04 and your issue may very well be different.

Changed in grub2 (Ubuntu):
status: New → Incomplete
Revision history for this message
Julian Andres Klode (juliank) wrote :

Oh yeah, please set debug=all and retry, if you can't give access for checking

Revision history for this message
Julian Andres Klode (juliank) wrote :

And try with the current SRU with the TFTP fix.

Revision history for this message
Julian Andres Klode (juliank) wrote :

Also try on different server, it might very well be a bug in the server's firmware.

Revision history for this message
Stéphane Graber (stgraber) wrote :

I believe between my setup and the other reported in here, it's been seen on at least 3 different platforms (some HP servers, Gigabyte servers and OVMF/EDK2 VMs).

Unfortunately, the image used doesn't matter in this case. This is network booting and MAAS only has a single grub2 binary used for everything that's booting from it.

Revision history for this message
Michał Ajduk (majduk) wrote :

Currently I have tried MAAS 2.9/candidate without the workaround descibed in the original bug report and I have reached a state where every deployment is successfull (taht includes fixes incorporated in 2.9 for the boot order issues).

I haven't analized the difference in the images (if any) between the failing one and the one that I use no though (I would have to check the grub image size at least).

Revision history for this message
Julian Andres Klode (juliank) wrote :

Stéphane:

You've never run net_bootp (in the attached screenshots at least) - the command that's failing here, so it's not certain if you have the same problem.

If you look at the end of the bug report, you'll see the image matters - the custom image works, but if linux.mod is included, it fails. Given that linux.mod seems entirely unrelated to net_bootp failing, the question was if that failure is a function of size and not of the specific module.

Revision history for this message
Julian Andres Klode (juliank) wrote :

Michał that sounds promising, if a bit unexpected, thanks for letting us know.

tags: added: fr-943
tags: removed: rls-hh-incoming
Jeff Lane  (bladernr)
tags: added: hwcert-server
Revision history for this message
Junien F (axino) wrote :

The test environment for the Ubuntu Foundations team is now ready.

Changed in grub2 (Ubuntu):
status: Incomplete → New
Revision history for this message
Michał Ajduk (majduk) wrote :

Regarding https://bugs.launchpad.net/maas/+bug/1900668/comments/15

I've also had this issue now with 2.9, so it's not MAAS related (so I believe we got back to common understanding that it is Grub issue)

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Michał Ajduk based on the above comments #13 and #15, can you please confirm if there is anything outstanding or not working w.r.t. this issue?

Changed in grub2 (Ubuntu):
status: New → Incomplete
Changed in maas-images:
status: New → Incomplete
Revision history for this message
Nobuto Murata (nobuto) wrote :

> Michał Ajduk based on the above comments #13 and #15, can you please confirm if there is anything outstanding or not working w.r.t. this issue?

Michal's comment on #17 scratched the comment #13. So there is still an outstanding issue here indeed.

Changed in maas-images:
status: Incomplete → New
Changed in grub2 (Ubuntu):
status: Incomplete → New
Changed in grub2 (Ubuntu):
status: New → Confirmed
Revision history for this message
Joelene M. Wheat (akthejo5725) wrote : Re: [Bug 1900668] Re: MAAS PXE Boot stalls with grub 2.02
Download full text (3.9 KiB)

so i understand most of this i just have 2 questions!
is it fixed?
Also what does this mean?

  UEFI Boot menu contains 2 entries alowing one to select the PXE mode:
  - HPE Ethernet 1Gb 4-port 366FLR Adapter - NIC (HTTP(S) IPv4)
  - HPE Ethernet 1Gb 4-port 366FLR Adapter - NIC (PXE IPv4)

thanks
joelene wheat

On Tue, Dec 8, 2020 at 7:30 AM Dimitri John Ledkov <
<email address hidden>> wrote:

> ** Changed in: grub2 (Ubuntu)
> Status: New => Confirmed
>
> --
> You received this bug notification because you are subscribed to MAAS.
> Matching subscriptions: Joelene M. Wheat
> https://bugs.launchpad.net/bugs/1900668
>
> Title:
> MAAS PXE Boot stalls with grub 2.02
>
> Status in MAAS:
> Invalid
> Status in maas-images:
> New
> Status in grub2 package in Ubuntu:
> Confirmed
>
> Bug description:
> # ENVIRONMENT
> MAAS version (SNAP):
> maas 2.8.2-8577-g.a3e674063 8980 2.8/stable canonical✓ -
>
> MAAS was cleanly installed. KVM POD setup works.
>
> MAAS status:
> bind9 RUNNING pid 9258, uptime 15:13:02
> dhcpd RUNNING pid 26173, uptime 15:09:30
> dhcpd6 STOPPED Not started
> http RUNNING pid 19526, uptime 15:10:49
> ntp RUNNING pid 27147, uptime 14:02:18
> proxy RUNNING pid 25909, uptime 15:09:33
> rackd RUNNING pid 7219, uptime 15:13:20
> regiond RUNNING pid 7221, uptime 15:13:20
> syslog RUNNING pid 19634, uptime 15:10:48
>
> Servers:
> HPE DL380 Gen10 configured to UEFI boot via PXE (PXE legacy mode),
> Secure boot disabled. All servers (18) experience the described problem.
>
> UEFI Boot menu contains 2 entries alowing one to select the PXE mode:
> - HPE Ethernet 1Gb 4-port 366FLR Adapter - NIC (HTTP(S) IPv4)
> - HPE Ethernet 1Gb 4-port 366FLR Adapter - NIC (PXE IPv4)
>
> # PROBLEM DESCRIPTION
> Similiar to https://bugs.launchpad.net/maas/+bug/1899840
>
> PXE boot stalls after downloading grubx64.efi but before downloading
> grub.cfg:
> 2020-10-20 07:18:21 provisioningserver.rackdservices.tftp: [info]
> bootx64.efi requested by 10.216.240.69
> 2020-10-20 07:18:21 provisioningserver.rackdservices.tftp: [info]
> bootx64.efi requested by 10.216.240.69
> 2020-10-20 07:18:21 provisioningserver.rackdservices.tftp: [info]
> grubx64.efi requested by 10.216.240.69
>
> Grub drops to the grub prompt.
> Within the grub prompt:
> - net_ls_addr shows correct IP address
> - net_ls_routes shows correct routing
> - net_bootps (that should initialize DHCP request from grub) fails with
> a message: failed to send packet
>
> We've also noticed that in a working scenario grub just after start up
> but before downloading grub conf sends arp request for MAAS IP:
> 13517 2020-10-19 13:53:38.864937 HewlettP_02:3d:e8 Broadcast
> ARP 60 Who has 10.216.240.1? Tell 10.216.240.51
> and MAAS replies.
>
> When the boot stalls, one of the symptoms is that grub does not send
> the ARP request for MAAS IP. It also does not reply to MAAS ARP
> requests. It looks as if the EFI_NET stack was failing.
>
> # WORKAROUNDS
> 1) during the the PXE boot send ARP requests from MAAS to query the node
> IP. This seems to prevent the node from loosing connect...

Read more...

no longer affects: maas
Nobuto Murata (nobuto)
tags: added: ps5
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Hirsute has an updated grub2 which improves networking performance during provisioning.

See https://launchpad.net/ubuntu/+source/grub2/2.04-1ubuntu37

The file that maas streams use from https://images.maas.io/ephemeral-v3/stable/bootloaders/uefi/amd64/20201123.0/grub2-signed.tar.xz is this one http://archive.ubuntu.com/ubuntu/dists/hirsute/main/uefi/grub2-amd64/2.04-1ubuntu37/grubnetx64.efi.signed

I would be interested to know if above improves provisioning reliability, and/or speed.

This should improve TCP deployments.

But there are more things we can do. Specifically there are patches available, but not yet integrated, that should allow to have HTTP and HTTPS provisioning be done with firmware accelerated code, i.e. reusing leases from UEFI firmware and using UEFI firmware paths for HTTP/HTTPS networking, without using the grub network stack per-se.

Revision history for this message
Junien F (axino) wrote :

@xnox as far as I know, the Foundations team still has access to the test hardware, if you want to test this new grub.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

@axino

Indeed we do. I have deployed maas there, and sideloaded grub from hirsute for deployments. However, irrespective of which grub I use, I cannot reproduce deployment issues.

Both nodes always deploy fine. =/ which is not great, as i was hoping to reproduce the failure to deploy.

does it matter where from the deployment happens? I.e. could it be that the NUC which is my rack+region controller is not under any network load, and hence keeps up with deploying everything?

Revision history for this message
Junien F (axino) wrote :

@xnox I was able to hit this bug while deploying a single server, so it's definitely not a case of the rackd not keeping up.

Which grub versions did you try ?
How are you updating the GRUB that MAAS serves ?

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

@axino

I'm tried the default grub, which appears to be the one synced from maas.io stream which is in fact bionic-updates grub 2.02 signed net x64 efi app.

I have downloaded the hirsute 2.04 signed net x64 efi app and replaced it at

/var/snap/maas/common/maas/boot-resources/snapshot-20210112-205807/bootloader/uefi/amd64/grubx64.efi

(there are many other files called that, which ultimately after all symlinks point there)

then i also unsquashfs maas snap, and changed the templates to echo the full version number of grub on boot.

Specifically tweaked

./lib/python3.8/site-packages/provisioningserver/templates/uefi/config.local.amd64.template

to have:

echo "XNOX... ${package_version}"

Then deployed machines whilst monitoring the console over BMC to confirm that deployments happen with XNOX... 2.04-1ubuntu37

I wonder if i'm using some other NIC? I see on my ps5 nodes that there are 5 NICs available and pxe is configured for me over eno1 1 Gbps link.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

One node is already tweaked a bit, as I fixed doing SecureBoot deployments on it.
And other one has centos deployed to debug another bug.
But I can revert back to "stock grub" non-secureboot and see if I can reproduce things.

Maybe we need to compare fw / bios versions too?

Or for example any other networking details / firewall / routers that might be different on my lucky two nodes.

Changed in grub2 (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Junien F (axino) wrote :

You're using the right interface. FW and BIOS versions are the same on all nodes. Leaving this bug in Incomplete state since I'm not directly impacted by this bug anymore. I guess we'll see what happens when the cloud gets redeployed from scratch (presumably by Field).

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Nonetheless, I am doing SRU of the changes I did in Hirsute to Focal which should improve the networking stack of grub.

And will continue to try to use those nodes to enable firmware accelerated https network boot too.

And I have submitted a proposal to switch maas-images from using bionic's grub to using the focal's grub.

This way, if this issue appears again, we we will be on a better footing hopefully.

Changed in maas-images:
status: New → Incomplete
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Focal grub2 sru is in updates.

Mass is in progress switching to using focal bootloaders.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

I'm also seeing this issue, and in my case typing in `normal` does nothing, it just gives me a new prompt.

As I see this in part of a MAAS deployment, typing in `reboot` would allow the deployment process progress from "Powering On" to "Deploying".

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

@frode do you have a reproducer for this?

Which streams are you using? We have just upgraded bootloader stream to have focal's 2.04 with improved networking performance, it should work better.

Changed in grub2 (Ubuntu):
status: Incomplete → Fix Released
Changed in maas-images:
status: Incomplete → Fix Released
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Not sure if it makes sense to carry on with this bug report, given that maas has been upgraded to use focal's grub 2.04, and 2.02 from bionic is no longer used for provisioning.

Revision history for this message
Mattias Andersson (jerrymattias) wrote :

Not sure if this i in anyway related but the symptoms is very close to what you saw with MaaS and Grub 2.02.
Issue that I have seen will occure on 2.06 and 2.16 as well. My environment is not MaaS but I am using grum from Ubuntu and since the symptom is very close I thought I would share it my bug report here in case it can help in any way. I got a method to reproduce the issue that is as efficient as it gets. For me, I only see issue on certain hardware, and I like to think it could be an issue with device firware, but I do not know how to prove my theory.

http://savannah.gnu.org/bugs/?func=detailitem&item_id=63245

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.