Comment 45 for bug 245779

Revision history for this message
Warren V (verbanista) wrote : Re: [Bug 245779] Re: Server 8.04 LTS: soft lockup - CPU#1 stuck for 11s! [bond1:3795] - bond - bond0

Hi-

Actually, I think I have one better. The latest redhat kernel patch release
for 2.6.18-128 seems to have fixed the issue (two weeks now, no reboot or
lockup), even though there is no "official" fix listed. It looks like they
made some alterations to the bonding code to fix some bogus MAC address
tracking silliness, which may be preventing the larger issue.

The patch discussion is at:
https://rhn.redhat.com/errata/RHSA-2009-0225.html
I downloaded the patch from:
http://people.redhat.com/dzickus/el5/128.el5/i686/

For those of us running CentOS, this is a straight rpm -ivh install. I
thought about doing the roll-my-own 2.6.24 install, but it was just too much
a jump ahead in kernel versions for me to be comfortable.

Thanks for the message!

-Warren V

On Mon, Feb 2, 2009 at 9:30 AM, Ryan Sitzman <email address hidden> wrote:

> This isn't a solution to the bug, but you may find that using the
> backports repository to install xen 3.3.0 and the 2.6.24-23 kernel
> yields some positive results. On one of my boxes, I could consistently
> trigger the 'CPU#1 stuck' problem, and after upgrading it hasn't locked
> up once. Of course, on a different box with slightly different hardware,
> it locks up just as frequently as before... so ymmv.
>
> --
> Server 8.04 LTS: soft lockup - CPU#1 stuck for 11s! [bond1:3795] - bond -
> bond0
> https://bugs.launchpad.net/bugs/245779
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in "linux" source package in Ubuntu: Confirmed
> Status in "linux" source package in Debian: Fix Released
>
> Bug description:
> Hi!
> Ubuntu Server 8.04 LTS with all patch and last kernel
> Hardware: HP DL360 G4 Xeon
> Bonding with :
> - bond0 2x1Gb Intel (802.3ad / 4)
> - bond1 8x1Gb Intel (802.3ad / 4)
> Nagios (only nrpe and plugin)
> Heartbeat2 (withour CRM)
> Vlan
>
> Today it crash (after two week uptime from kernel upgrade) with this output
>
> 6640927 firewall 11:46:54 kernel: [431168.944816] BUG: soft lockup - CPU#1
> stuck for 11s! [bond1:3795]
> 6640928 firewall 11:46:54 kernel: [431168.944849]
> 6640929 firewall 11:46:54 kernel: [431168.944853] Pid: 3795, comm: bond1
> Not tainted (2.6.24-19-server #1)
> 6640930 firewall 11:46:54 kernel: [431168.944856] EIP:
> 0060:[ipv6:_spin_lock+0xa/0x10] EFLAGS: 00000286 CPU: 1
> 6640931 firewall 11:46:54 kernel: [431168.944865] EIP is at
> _spin_lock+0xa/0x10
> 6640932 firewall 11:46:54 kernel: [431168.944867] EAX: f749f334 EBX:
> f749f25c ECX: 00000001 EDX: f749f25c
> 6640933 firewall 11:46:54 kernel: [431168.944870] ESI: 00000000 EDI:
> f7ca1000 EBP: f6c35c80 ESP: f6835cc0
> 6640934 firewall 11:46:54 kernel: [431168.944872] DS: 007b ES: 007b FS:
> 00d8 GS: 0000 SS: 0068
> 6640935 firewall 11:46:54 kernel: [431168.944875] CR0: 8005003b CR2:
> b7bfd0a0 CR3: 35908000 CR4: 000006b0
> 6640936 firewall 11:46:54 kernel: [431168.944878] DR0: 00000000 DR1:
> 00000000 DR2: 00000000 DR3: 00000000
> 6640937 firewall 11:46:54 kernel: [431168.944880] DR6: ffff0ff0 DR7:
> 00000400
> 6640938 firewall 11:46:54 kernel: [431168.944887] [<f8b67606>]
> ad_rx_machine+0x26/0x690 [bonding]
> 6640939 firewall 11:46:54 kernel: [431168.944899]
> [nf_nat:_read_lock_bh+0x8/0x50] _read_lock_bh+0x8/0x20
> 6640940 firewall 11:46:54 kernel: [431168.944920] [arp_process+0x8b/0x5f0]
> arp_process+0x8b/0x5f0
> 6640941 firewall 11:46:54 kernel: [431168.944930] [<f8b67e6a>]
> bond_3ad_lacpdu_recv+0x1fa/0x240 [bonding]
> 6640942 firewall 11:46:54 kernel: [431168.944946]
> [ip_local_deliver_finish+0xf9/0x210] ip_local_deliver_finish+0xf9/0x210
> 6640943 firewall 11:46:54 kernel: [431168.944955]
> [ip_rcv_finish+0xff/0x370] ip_rcv_finish+0xff/0x370
> 6640944 firewall 11:46:54 kernel: [431168.944960]
> [sock_def_write_space+0x12/0xa0] sock_def_write_space+0x12/0xa0
> 6640945 firewall 11:46:54 kernel: [431168.944968] [<f8967a4b>]
> e1000_alloc_rx_buffers+0xab/0x3a0 [e1000]
> 6640946 firewall 11:46:54 kernel: [431168.944982] [arp_rcv+0x0/0x140]
> arp_rcv+0x0/0x140
> 6640947 firewall 11:46:54 kernel: [431168.944994]
> [e1000:__netdev_alloc_skb+0x22/0x2a80] __netdev_alloc_skb+0x22/0x50
> 6640948 firewall 11:46:54 kernel: [431168.945000] [<f8b67c70>]
> bond_3ad_lacpdu_recv+0x0/0x240 [bonding]
> 6640949 firewall 11:46:54 kernel: [431168.945011]
> [tg3:netif_receive_skb+0x379/0x720] netif_receive_skb+0x379/0x440
> 6640950 firewall 11:46:54 kernel: [431168.945024] [<f8968474>]
> e1000_clean_rx_irq+0x174/0x500 [e1000]
> 6640951 firewall 11:46:54 kernel: [431168.945037] [<f8968378>]
> e1000_clean_rx_irq+0x78/0x500 [e1000]
> 6640952 firewall 11:46:54 kernel: [431168.945059] [<f8968300>]
> e1000_clean_rx_irq+0x0/0x500 [e1000]
> 6640953 firewall 11:46:54 kernel: [431168.945071] [<f896569e>]
> e1000_clean+0x5e/0x250 [e1000]
> 6640954 firewall 11:46:54 kernel: [431168.945085]
> [net_rx_action+0x12d/0x210] net_rx_action+0x12d/0x210
> 6640955 firewall 11:46:54 kernel: [431168.945099] [__do_softirq+0x82/0x110]
> __do_softirq+0x82/0x110
> 6640956 firewall 11:46:54 kernel: [431168.945109] [do_softirq+0x55/0x60]
> do_softirq+0x55/0x60
> 6640957 firewall 11:46:54 kernel: [431168.945113] [irq_exit+0x6d/0x80]
> irq_exit+0x6d/0x80
> 6640958 firewall 11:46:54 kernel: [431168.945117] [do_IRQ+0x40/0x70]
> do_IRQ+0x40/0x70
> 6640959 firewall 11:46:54 kernel: [431168.945121]
> [find_busiest_group+0x1bd/0x760] find_busiest_group+0x1bd/0x760
> 6640960 firewall 11:46:54 kernel: [431168.945130]
> [common_interrupt+0x23/0x28] common_interrupt+0x23/0x28
> 6640961 firewall 11:46:54 kernel: [431168.945142] [<f897007b>]
> e1000_init_hw+0x34b/0xb50 [e1000]
> 6640962 firewall 11:46:54 kernel: [431168.945156]
> [ipv6:_spin_lock+0x3/0x10] _spin_lock+0x3/0x10
> 6640963 firewall 11:46:54 kernel: [431168.945163] [<f8b67606>]
> ad_rx_machine+0x26/0x690 [bonding]
> 6640964 firewall 11:46:54 kernel: [431168.945179]
> [lock_timer_base+0x27/0x60] lock_timer_base+0x27/0x60
> 6640965 firewall 11:46:54 kernel: [431168.945183]
> [delayed_work_timer_fn+0x0/0x20] delayed_work_timer_fn+0x0/0x20
> 6640966 firewall 11:46:54 kernel: [431168.945194] [<f8b68290>]
> bond_3ad_state_machine_handler+0xf0/0x9b0 [bonding]
> 6640967 firewall 11:46:54 kernel: [431168.945206]
> [queue_delayed_work_on+0x7c/0xb0] queue_delayed_work_on+0x7c/0xb0
> 6640968 firewall 11:46:54 kernel: [431168.945214]
> [usbcore:queue_delayed_work+0x51/0x70] queue_delayed_work+0x51/0x70
> 6640969 firewall 11:46:54 kernel: [431168.945221] [<f8b681a0>]
> bond_3ad_state_machine_handler+0x0/0x9b0 [bonding]
> 6640970 firewall 11:46:54 kernel: [431168.945229]
> [run_workqueue+0xbf/0x160] run_workqueue+0xbf/0x160
> 6640971 firewall 11:46:54 kernel: [431168.945240] [worker_thread+0x0/0xe0]
> worker_thread+0x0/0xe0
> 6640972 firewall 11:46:54 kernel: [431168.945245] [worker_thread+0x84/0xe0]
> worker_thread+0x84/0xe0
> 6640973 firewall 11:46:54 kernel: [431168.945251] [<c0145fc0>]
> autoremove_wake_function+0x0/0x40
> 6640974 firewall 11:46:54 kernel: [431168.945260] [worker_thread+0x0/0xe0]
> worker_thread+0x0/0xe0
> 6640975 firewall 11:46:54 kernel: [431168.945265] [kthread+0x42/0x70]
> kthread+0x42/0x70
> 6640976 firewall 11:46:54 kernel: [431168.945269] [kthread+0x0/0x70]
> kthread+0x0/0x70
> 6640977 firewall 11:46:54 kernel: [431168.945274]
> [kernel_thread_helper+0x7/0x10] kernel_thread_helper+0x7/0x10
> 6640978 firewall 11:46:54 kernel: [431168.945284] =======================
>
> Can you help me?
>
> Very thanks
>
> ---
> Sim
>