Wireless connection does not re-connect

Bug #1181964 reported by Bernd Edlinger
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
network-manager (Ubuntu)
Confirmed
Medium
Unassigned

Bug Description

If the Access point is rebooted, Linux wont re-connect.
That behaviour is reproducable with the -43 Kernel version.
Only way to re-connect is disable & enable Wireless connection or reboot.

Previous version, -41 and -39 did automatically re-connect after a few seconds.

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: linux-image-3.2.0-43-generic-pae 3.2.0-43.68
ProcVersionSignature: Ubuntu 3.2.0-43.68-generic-pae 3.2.42
Uname: Linux 3.2.0-43-generic-pae i686
NonfreeKernelModules: fcclassic
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24.
AplayDevices: aplay: device_list:252: no soundcards found...
ApportVersion: 2.0.1-0ubuntu17.2
Architecture: i386
ArecordDevices: arecord: device_list:252: no soundcards found...
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/by-path', '/dev/snd/controlC0', '/dev/snd/pcmC0D0c', '/dev/snd/pcmC0D0p', '/dev/snd/pcmC0D1p', '/dev/snd/midiC0D0', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Mon May 20 09:53:54 2013
HibernationDevice: RESUME=UUID=20273ec5-368b-49e4-b7b4-b9983dc66732
InstallationMedia: Ubuntu 12.04.1 LTS "Precise Pangolin" - Release i386 (20120817.1)
Lsusb: Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: Giga-Byte Technology CO., LTD i440BX-W977
MarkForUpload: True
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=de_DE.UTF-8
 SHELL=/bin/bash
ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-3.2.0-43-generic-pae root=UUID=cb9535b0-5aba-42d3-9186-2a5255f2d167 ro nomodeset quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-3.2.0-43-generic-pae N/A
 linux-backports-modules-3.2.0-43-generic-pae N/A
 linux-firmware 1.79.4
RfKill:
 0: phy0: Wireless LAN
  Soft blocked: no
  Hard blocked: no
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 12/18/00
dmi.bios.vendor: Award Software International, Inc.
dmi.bios.version: 4.51 PG
dmi.board.name: i440BX-W977
dmi.board.vendor: Corporation Name
dmi.board.version: 1.0
dmi.chassis.type: 2
dmi.chassis.vendor: Corporation Name
dmi.chassis.version: 1.0
dmi.modalias: dmi:bvnAwardSoftwareInternational,Inc.:bvr4.51PG:bd12/18/00:svnGiga-ByteTechnologyCO.,LTD:pni440BX-W977:pvr1.0:rvnCorporationName:rni440BX-W977:rvr1.0:cvnCorporationName:ct2:cvr1.0:
dmi.product.name: i440BX-W977
dmi.product.version: 1.0
dmi.sys.vendor: Giga-Byte Technology CO., LTD

Revision history for this message
Bernd Edlinger (bernd-edlinger) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream stable kernel? Please test the latest v3.2 stable kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.2.45-precise/

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
tags: added: regression-update
Revision history for this message
Bernd Edlinger (bernd-edlinger) wrote :

Ok, I installed the following kernel: linux-image-3.2.45-030245-generic-pae_3.2.45-030245.201305140735_i386.deb

This bug is definitely fixed with this image:
A short disconnect (reboot AP) => network re-connects immediately.
A long disconnect (1 minute power-off AP) => network re-connects after 5 minutes.

If you like I can upload syslog from upstream kernel for analysis.
Thanks!

tags: added: kernel-fixed-upstream
Revision history for this message
Bernd Edlinger (bernd-edlinger) wrote :

Oops, sorry...

A few hours later the bug showed up again -- in the upstream kernel, without me trying to reproduce it.

The traces in the event at time index 43224.360095 look very similar to the -43 kernel.
[43223.763471] wlan0: authenticate with 00:13:49:e3:9d:8e (try 1)
[43223.960105] wlan0: authenticate with 00:13:49:e3:9d:8e (try 2)
[43224.160122] wlan0: authenticate with 00:13:49:e3:9d:8e (try 3)
[43224.360095] wlan0: authentication with 00:13:49:e3:9d:8e timed out

Therefore: linux 3.2.45 does not fix this issue completely!

Attached you'll find the kernel messages from the complete session,
first are two successful reconnects and later the failed re-connect.

tags: added: kernel-bug-exists-upstream
removed: kernel-fixed-upstream
Revision history for this message
Bernd Edlinger (bernd-edlinger) wrote :

Hello,

unforutnately, I must admit, that I am no longer able to reproduce this bug here.

what I did was compile & install a locally generated test kernel over the 3.2.0-43 and later
re-install the 3.2.0-43 kernel from the .deb file in /var/cache/apt/archives

and, guess: now the network connects nicely again.

Therefore I do no longer think that the change between 3.2.0-41 and 3.2.0-43 can be connected to
the connectivity problem, because it is just one line of code, in this function, which is apparently
never executed on my machine (I added a printk there, but that was not executed):

--- linux-3.2.0/kernel/events/core.c
+++ linux-3.2.0/kernel/events/core.c
@@ -5164,7 +5164,7 @@

 static int perf_swevent_init(struct perf_event *event)
 {
- int event_id = event->attr.config;
+ u64 event_id = event->attr.config;

        if (event->attr.type != PERF_TYPE_SOFTWARE)
                return -ENOENT;

Therefore the reason must be somewhere else, in case the
network problem comes back, I added the following lines to
/etc/NetworkManager/NetworkManager.conf:

[logging]
level=DEBUG
domains=HW,RFKILL,ETHER,WIFI,BT,MB,DHCP4,DHCP6,PPP,WIFI_SCAN,IP4,IP6,AUTOIP4,DNS,VPN,SHARING,SUPPLICANT,AGENTS,SETTINGS,SUSPEND,CORE,DEVICE,OLPC,WIMAX

maybe that traces will give us some better insight.

Thanks, and Good Bye.

Revision history for this message
Bernd Edlinger (bernd-edlinger) wrote :

Now it did happen again. The Network Manager traces show the Access point has a short interruption,
and access point is removed from the Network List, only one other AP in the list, but the data is wrong:
ap_list_dump() does not show the same data than "iwlist wlan0 scanning".

=> therefore it seems to be a Network Manager bug, but much harder to reproduce than I initially assumed.

Revision history for this message
Bernd Edlinger (bernd-edlinger) wrote :

OK, when you look at nm-device-wifi.c I think I see what is wrong:

When the AP is almost always sending its beacons, but reboots quickly at some point in time, it can be
deleted by cull_scan_list() if it is scheduled exactly at the second because it is not the active AP, and the list entry
was not changed for 3*SCAN_INTERVAL_MAX, and the property WPAS_REMOVED_TAG is set.
some underlying supplicant process does only send bss_removed or new_bss message when the live list is changed
from its perspective.

However, when the AP is once not in the live list when scanned, it may not be guaranteed that the connection really
breaks, but the flag WPAS_REMOVED_TAG is set now and will not be reset when the new_bss is later received the and
the AP data are updated, but the WPAS_REMOVED_TAG should be cleared again. So when that flag is set, cull_scan_list()
can remove the AP at any time, even if the connection would be restored very quickly.
this should be fixed by adding the following line to merge_scanned_ap() when the ap data is updated:

g_object_set_data (G_OBJECT (ap), WPAS_REMOVED_TAG, NULL);

And probably the function supplicant_iface_bss_removed_cb() should set the last-seen property to now.
You should add the following statemeents there to prevent the AP to be removed immediately,
if the signal strength did not change recently.:

g_get_current_time (&now);
nm_ap_set_last_seen (ap, now.tv_sec);

Revision history for this message
Bernd Edlinger (bernd-edlinger) wrote :

I am pretty sure that this is a network manager bug, and how to solve it.

affects: linux (Ubuntu) → network-manager (Ubuntu)
Revision history for this message
Bernd Edlinger (bernd-edlinger) wrote :

I compiled the network-manager component as follows:

cd network-manager-0.9.4.0

./configure --prefix=/usr --sysconfdir=/etc --localstatedir=/var --enable-more-warnings=no

make

installed (as root): src/.libs/NetworkManager to /usr/sbin/NetworkManager (renamed original NetworkManager for possible undo)

The patch runs on my machine since 24h now, without permanent connectivity losses up to now.

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "this is a proposed fix for this bug." seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Revision history for this message
Bernd Edlinger (bernd-edlinger) wrote :

Yes, it is a patch.

Steps to reproduce:

- configure a permanent connection to a WEP-encrypted AP.
- the AP has very good signal quality, and is 99.9% of the time available.
- only sporadic short 1-2 seconds interruptions of the beacon.
- one or two other APs have very poor signal quality, and enter/leave the live list often.

after some days/weeks/months the wifi connection breaks, and the network manager
does no longer see the AP, although the iwlist wlan0 scanning sees it at 99.9% of the time.

That status lasts until either network manager is re-started or the AP is shut down for 10 minutes
and then started again, at that time the AP is visible again.

I am 100% sure that this has an impact on other users too.

Changed in network-manager (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Bernd Edlinger (bernd-edlinger) wrote :

PING...
this patch did run non-stop for 6 weeks now...

Revision history for this message
Bernd Edlinger (bernd-edlinger) wrote :

Status update

finally I was able to fix this issue upstream see: https://bugzilla.gnome.org/show_bug.cgi?id=733105

But it is only completely fixed in network-manager 1.0

If you want to fix something for ubuntu 12.04 or ubuntu 14.04
you can use my latest local patches.

Note: that in ubuntu 14.04 there is also a bug in the wpa_supplicant
which can lock up the radio work queue, which makes any further
WiFi connections impossible.

Fortunately that must have been fixed in the meantime, because
I found a fix in the upstream wpa_supplicant repository.

Revision history for this message
Bernd Edlinger (bernd-edlinger) wrote :

here is the latest network-manager patch for 14.04

Revision history for this message
Bernd Edlinger (bernd-edlinger) wrote :

This is a wpa_supplicant fix that was found upstreams.
It is only necessary for ubuntu 14.04.

The wpa_supplicant from ubuntu 12.04 did not try to
do an internal scan for the AP, and does not need any fix.

To post a comment you must log in.