Bug #625239 “X starts on wrong tty because gdm starts before nvi...” : Bugs : nvidia-graphics-drivers package : Ubuntu

Revision history for this message

Bilal Akhtar (bilalakhtar) wrote on 2010-08-27:

#1

BootDmesg.txt Edit (60.4 KiB, text/plain; charset="utf-8")
CurrentDmesg.txt Edit (5.0 KiB, text/plain; charset="utf-8")
DRM.card0.DVI.I.1.txt Edit (107 bytes, text/plain; charset="utf-8")
DRM.card0.SVIDEO.1.txt Edit (107 bytes, text/plain; charset="utf-8")
Dependencies.txt Edit (5.1 KiB, text/plain; charset="utf-8")
GdmLog.txt Edit (25.7 KiB, text/plain; charset="utf-8")
GdmLog1.txt Edit (25.2 KiB, text/plain; charset="utf-8")
GdmLog2.txt Edit (37.6 KiB, text/plain; charset="utf-8")
Lspci.txt Edit (14.4 KiB, text/plain; charset="utf-8")
Lsusb.txt Edit (386 bytes, text/plain; charset="utf-8")
PciDisplay.txt Edit (790 bytes, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (1.5 KiB, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (1.6 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (3.7 KiB, text/plain; charset="utf-8")
RelatedPackageVersions.txt Edit (232 bytes, text/plain; charset="utf-8")
UdevDb.txt Edit (105.8 KiB, text/plain; charset="utf-8")
UdevLog.txt Edit (253.6 KiB, text/plain; charset="utf-8")
XorgLog.txt Edit (33.9 KiB, text/plain; charset="utf-8")
XorgLogOld.txt Edit (33.9 KiB, text/plain; charset="utf-8")
Xrandr.txt Edit (3.6 KiB, text/plain; charset="utf-8")
monitors.xml.txt Edit (1.6 KiB, text/plain; charset="utf-8")
peripherals.txt Edit (1.5 KiB, text/plain; charset="utf-8")
setxkbmap.txt Edit (296 bytes, text/plain; charset="utf-8")
xdpyinfo.txt Edit (17.3 KiB, text/plain; charset="utf-8")
xkbcomp.txt Edit (57.0 KiB, text/plain; charset="utf-8")

Bryce Harrington (bryce) on 2010-08-27

affects:

xorg (Ubuntu) → xserver-xorg-video-ati (Ubuntu)

Bryce Harrington (bryce) on 2010-08-27

Changed in xserver-xorg-video-ati (Ubuntu):
status:	New → Confirmed

Revision history for this message

Bilal Akhtar (bilalakhtar) wrote on 2010-08-27:

#2

This part of the Gdm 1 log is suspicious:

Backtrace:
0: /usr/bin/X (xorg_backtrace+0x3b) [0x80e83bb]
1: /usr/bin/X (0x8048000+0x5da8d) [0x80a5a8d]
2: (vdso) (__kernel_rt_sigreturn+0x0) [0xc0840c]
3: /usr/bin/X (0x8048000+0x25d3e) [0x806dd3e]
4: /usr/bin/X (0x8048000+0x1a5ba) [0x80625ba]
5: /lib/libc.so.6 (__libc_start_main+0xe7) [0x3c8ce7]
6: /usr/bin/X (0x8048000+0x1a191) [0x8062191]

Caught signal 3 (Quit). Server aborting

Please consult the The X.Org Foundation support
at http://wiki.x.org
for help.
Please also check the log file at "/var/log/Xorg.0.log" for additional information.

Revision history for this message

Bilal Akhtar (bilalakhtar) wrote on 2010-08-27: Re: Pressing enter key kills X

#3

I think this one should be triaged. Should I mark it so? If it has been confirmed and it has enough details to get fixed, and if it has no dups then it should be triaged.

summary:	- Pressing enter kills X + Pressing enter key kills X
Changed in xserver-xorg-video-ati (Ubuntu):
importance:	Undecided → Medium

Revision history for this message

Bilal Akhtar (bilalakhtar) wrote on 2010-08-27:

#4

It appears bug #532047 is the correct one and this is a dup.

Revision history for this message

Bilal Akhtar (bilalakhtar) wrote on 2010-08-28:

#5

I did an experiment. I renamed the plymouth-splash.conf file in the /etc/init folder to *.disabled and then re-started without plymouth. It worked!

This problem is affecting plymouth and not X.

summary:	- Pressing enter key kills X + Pressing Enter crashes both X and plymouth in graphical mode
description:	updated

Bryce Harrington (bryce) on 2010-08-29

tags:

added: crash

Revision history for this message

Andres (andres-erbsen) wrote on 2010-08-29: Re: Pressing Enter crashes both X and plymouth in graphical mode

#6

This bug is still here from me in Maverick.
plymouth 0.8.2-2ubuntu3.
The crashed session is on VT7, new one on VT8.
Crashes on ENTER. Not tried "2".
Intel 4500 graphics. This is unrelated to ati.
Encrypted home, if it matters.

Revision history for this message

Andres (andres-erbsen) wrote on 2010-08-29:

#7

I have intel graphics card and still have this bug, so it's unrelated to ati driver.

Changed in xserver-xorg-video-ati (Ubuntu):
status:	Confirmed → Invalid

Robert Hall (reh07) on 2010-08-29

Changed in plymouth (Ubuntu):
status:	New → Confirmed

Bilal Akhtar (bilalakhtar) on 2010-08-29

Changed in plymouth (Ubuntu):
importance:	Undecided → Medium

Revision history for this message

Andres (andres-erbsen) wrote on 2010-08-30:

#8

plymouth-debug.log Edit (21.6 KiB, text/plain)

Daniel Hahler (blueyed) on 2010-09-01

Changed in plymouth (Ubuntu):
status:	Confirmed → Triaged

Steve Langasek (vorlon) on 2010-09-01

Changed in plymouth (Ubuntu):
status:	Triaged → Confirmed

Revision history for this message

Bilal Akhtar (bilalakhtar) wrote on 2010-09-01:

#9

I think its good to triage this one, I don't know why steve didn't. Should I?

Revision history for this message

Steve Langasek (vorlon) wrote on 2010-09-01:

#10

"triaged" means "contains all the information necessary for a developer to start working on it". I don't know this to be the case - when I look at this bug, *I* don't see what appears to be the information needed to reproduce and debug this; if this were a widespread problem there would be many more open bug reports about it, so I think this is specific to your environment, in a way that will be difficult to pinpoint. So this bug is confirmed, but not triaged.

Revision history for this message

Bilal Akhtar (bilalakhtar) wrote on 2010-09-02:

#11

I got the reason why this is happening. When X starts up, it sends a deactivate request to plymouth and not a quit request. So, plymouth keeps running, and it interprets the keypresses of enter and 2 to be of a quit request.

Revision history for this message

Bilal Akhtar (bilalakhtar) wrote on 2010-09-02:

#12

Hence, I have made a patch on my sys (testing right now) which is more of a 'hack' rather than a fix, since it instructs plymouth to quit on deactivate requests as well. Should work well, since during the boot phase, its the only time it gets a deactivate request.

Revision history for this message

Bilal Akhtar (bilalakhtar) wrote on 2010-09-02:

#13

Didn't work, give up

Revision history for this message

Robert Hall (reh07) wrote on 2010-09-02:

#14

@Steve Langasek:

What log/trace are we missing?

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-03:

#15

Download full text (3.4 KiB)

After spending hours pouring over Google results and bug reports like bug #529230 and bug #532047, I believe this is the bug I'm seeing on my Lucid (Ubuntu 10.04 64-bit on a new System76 Serval Pro--serp6) laptop. I have a decent amount of information to share, which corroborates this bug, and I've tried every workaround possible with no luck.

Honestly, bug #529230 appears a better fit, but it's listed as a duplicate of bug #532047, which supposedly released a fix. It was suggested that I comment on this bug. The slight difference between my experience and this bug's description is that I don't see a crash as soon as I press Enter, but rather later in my session it will crash when hitting Enter (after having pressed Enter many of times already).

Here is what I can confirm:

1. Upon first boot, I login to GNOME, and after normal use anywhere between 5 minutes and an hour, it will crash (always as I press Enter, to the best of my knowledge). I see a blank screen, and then the login screen again.

2. If I login again (after a crash, or if I just logout right away and login again), I never see another X reset until after a reboot.

3. After an initial boot up and first login, I'm always on tty1 or tty2, and I'll always get the X reset eventually (I'm certain I've witnessed this on both VT1 and VT2).

4. After logging in a second time (or if I just logout and login again after a boot), I'm always on VT8 and I never have an X reset.

5. On VT7 (Ctrl-Alt-F7), for both first and second logins, I always see an error like this:
(process:516) GLib-WARNING **: getpwuid_r(): failed due to unknown user id (0)

6. I see the following in auth.log, just before a crash (cat /var/log/auth.log | grep FAILED):
Sep 1 21:38:16 jkrug-serval login[6215]: FAILED LOGIN (1) on '/dev/tty1' FOR 'UNKNOWN', Authentication failure
Sep 1 21:38:19 jkrug-serval login[6215]: FAILED LOGIN (2) on '/dev/tty1' FOR 'UNKNOWN', Error in service module
Sep 1 21:38:22 jkrug-serval login[6215]: FAILED LOGIN (3) on '/dev/tty1' FOR 'UNKNOWN', Error in service module
Sep 1 21:38:25 jkrug-serval login[6215]: FAILED LOGIN (4) on '/dev/tty1' FOR 'UNKNOWN', Error in service module
Sep 1 21:38:27 jkrug-serval login[6215]: FAILED LOGIN (5) on '/dev/tty1' FOR 'UNKNOWN', Error in service module

Here are the suggested workarounds that do NOT work for me:

1. I've tried adding APT source ppa:ubuntu-x-swat/x-updates and reinstalling my nVidia (current) driver, as suggested in this thread on the issue:
http://ubuntuforums.org/showthread.php?t=1545846

2. I tried changing /etc/default/grub back to standard distro settings instead of System76-shipped version (also suggested in above mentioned thread).

3. I also tried the "temporary workaround" in bug #532047 -- sudo mv /etc/init/plymouth-splash.conf /etc/init/plymouth-splash.conf.disabled -- no change. *Actually, I later noticed that on the day I tried this, I definitely had an X reset/crash, but noticed that there was no "UNKNOWN" grep from /var/log/auth.log on that date. So, slight change, or just an anomaly, but either way still a horribly annoying bug.

I've spent many hours on this and would love to see it go away. BTW, I have nVidia current driv...

After spending hours pouring over Google results and bug reports like bug #529230 and bug #532047, I believe this is the bug I'm seeing on my Lucid (Ubuntu 10.04 64-bit on a new System76 Serval Pro--serp6) laptop. I have a decent amount of information to share, which corroborates this bug, and I've tried every workaround possible with no luck.

Honestly, bug #529230 appears a better fit, but it's listed as a duplicate of bug #532047, which supposedly released a fix. It was suggested that I comment on this bug. The slight difference between my experience and this bug's description is that I don't see a crash as soon as I press Enter, but rather later in my session it will crash when hitting Enter (after having pressed Enter many of times already).

Here is what I can confirm:

1. Upon first boot, I login to GNOME, and after normal use anywhere between 5 minutes and an hour, it will crash (always as I press Enter, to the best of my knowledge). I see a blank screen, and then the login screen again.

2. If I login again (after a crash, or if I just logout right away and login again), I never see another X reset until after a reboot.

3. After an initial boot up and first login, I'm always on tty1 or tty2, and I'll always get the X reset eventually (I'm certain I've witnessed this on both VT1 and VT2).

4. After logging in a second time (or if I just logout and login again after a boot), I'm always on VT8 and I never have an X reset.

5. On VT7 (Ctrl-Alt-F7), for both first and second logins, I always see an error like this:
(process:516) GLib-WARNING **: getpwuid_r(): failed due to unknown user id (0)

6. I see the following in auth.log, just before a crash (cat /var/log/auth.log | grep FAILED):
Sep 1 21:38:16 jkrug-serval login[6215]: FAILED LOGIN (1) on '/dev/tty1' FOR 'UNKNOWN', Authentication failure
Sep 1 21:38:19 jkrug-serval login[6215]: FAILED LOGIN (2) on '/dev/tty1' FOR 'UNKNOWN', Error in service module
Sep 1 21:38:22 jkrug-serval login[6215]: FAILED LOGIN (3) on '/dev/tty1' FOR 'UNKNOWN', Error in service module
Sep 1 21:38:25 jkrug-serval login[6215]: FAILED LOGIN (4) on '/dev/tty1' FOR 'UNKNOWN', Error in service module
Sep 1 21:38:27 jkrug-serval login[6215]: FAILED LOGIN (5) on '/dev/tty1' FOR 'UNKNOWN', Error in service module

Here are the suggested workarounds that do NOT work for me:

1. I've tried adding APT source ppa:ubuntu-x-swat/x-updates and reinstalling my nVidia (current) driver, as suggested in this thread on the issue:
http://ubuntuforums.org/showthread.php?t=1545846

2. I tried changing /etc/default/grub back to standard distro settings instead of System76-shipped version (also suggested in above mentioned thread).

3. I also tried the "temporary workaround" in bug #532047 -- sudo mv /etc/init/plymouth-splash.conf /etc/init/plymouth-splash.conf.disabled -- no change. *Actually, I later noticed that on the day I tried this, I definitely had an X reset/crash, but noticed that there was no "UNKNOWN" grep from /var/log/auth.log on that date. So, slight change, or just an anomaly, but either way still a horribly annoying bug.

I've spent many hours on this and would love to see it go away. BTW, I have nVidia current drivers with Ubuntu 10.04 64bit on my Dell Vostro 1510, which never has this issue--always login to VT7. I would be super appreciative if anyone has any ideas or if I can provide any further information. Thanks in advance for any help!

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-03:

#16

@Jamie:
If you don't mind my asking, what version of Plymouth are you running?

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-03:

#17

@John:
Synaptic shows plymouth 0.8.2-2ubuntu2.

Revision history for this message

Steve Langasek (vorlon) wrote on 2010-09-03:

#18

Bilal,

X sends no requests at all to plymouth. The display manager (gdm, kdm, etc) does this; and each of these DMs are patched in Ubuntu to send plymouth a quit request at the appropriate time. If you are not using the Ubuntu 10.04 packages for the display manager, it's possible this is the cause of the problem you're seeing.

plymouth is NOT supposed to quit on 'deactivate' requests, and any patch to make it do this would be rejected.

Anyway, if pressing enter causes X to exit, the problem is most specifically *not* a result of plymouth capturing and responding to these keypresses; it's almost certainly a result of X being started on the wrong VT, one which already has a getty running on it. What VT is X listed as running on in a process list when this happens? (You'll want to switch to another VT and log in on the console to check this.)

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-03:

#19

Steve,

I have X running on VT1 or VT2 after every reboot, and I always eventually have X crash on me when pressing Enter (but never the first pressing of Enter).

Revision history for this message

Bilal Akhtar (bilalakhtar) wrote on 2010-09-04: Re: [Bug 625239] Re: Pressing Enter crashes both X and plymouth in graphical mode

#20

On Fri, 2010-09-03 at 20:24 +0000, Steve Langasek wrote:
> Bilal,
>
> X sends no requests at all to plymouth. The display manager (gdm, kdm,
> etc) does this; and each of these DMs are patched in Ubuntu to send
> plymouth a quit request at the appropriate time. If you are not using
> the Ubuntu 10.04 packages for the display manager, it's possible this is
> the cause of the problem you're seeing.

You're right. This was the first time I tried to play with plymouth, and
I messed up stuff. Thankfully, I was able to revert back to the Ubuntu
packages without any further problem and now I am back to ground zero.

>
> plymouth is NOT supposed to quit on 'deactivate' requests, and any patch
> to make it do this would be rejected.
>
> Anyway, if pressing enter causes X to exit, the problem is most
> specifically *not* a result of plymouth capturing and responding to
> these keypresses; it's almost certainly a result of X being started on
> the wrong VT, one which already has a getty running on it. What VT is X
> listed as running on in a process list when this happens? (You'll want
> to switch to another VT and log in on the console to check this.)

I will check this soon.
>

Revision history for this message

Bilal Akhtar (bilalakhtar) wrote on 2010-09-04:

#21

On Fri, 2010-09-03 at 20:24 +0000, Steve Langasek wrote:
> Bilal,
>
> X sends no requests at all to plymouth. The display manager (gdm, kdm,
> etc) does this; and each of these DMs are patched in Ubuntu to send
> plymouth a quit request at the appropriate time. If you are not using
> the Ubuntu 10.04 packages for the display manager

FYI I am using Maverick with all updates installed.

>
> plymouth is NOT supposed to quit on 'deactivate' requests, and any patch
> to make it do this would be rejected.
>
> Anyway, if pressing enter causes X to exit, the problem is most
> specifically *not* a result of plymouth capturing and responding to
> these keypresses; it's almost certainly a result of X being started on
> the wrong VT, one which already has a getty running on it. What VT is X
> listed as running on in a process list when this happens? (You'll want
> to switch to another VT and log in on the console to check this.)
>

Revision history for this message

Bilal Akhtar (bilalakhtar) wrote on 2010-09-04: Re: Pressing Enter crashes both X and plymouth in graphical mode

#22

Probably Jamie is having a different problem, since the workaround on bug #532047 worked very well for me.

Revision history for this message

dino99 (9d9) wrote on 2010-09-05:

#23

what i've experienced:

- on lucid i386 + x-swat, with no encryption: no issue at all about this problem
- on maverick updated (i386 + x-swat + no encryption):

system boot fine then login seem ok ( no crash when validating password) but later when i need to enter password again ( for example when asked by synaptic or any other app/service), at that time, validating this password (by pressing "enter") drop me back to gdm login. Reentering password and validating it again work fine this time and dont crash till next reboot.

On my end, this issue is only with maverick and appear 2 weeks ago ( my installation is with gnome+compiz, will try without compiz and report back if any change)

Revision history for this message

dino99 (9d9) wrote on 2010-09-05:

#24

auth.log Edit (3.3 KiB, text/plain)

what is strange: when i watch errors logged into auth.log, my password is declared each time as "incorrect" (im sure its good) after the first login

Revision history for this message

dino99 (9d9) wrote on 2010-09-05:

#25

what happen after few tweaks:

- compiz package installed, system/prefs/appearance/visual effects: no visual effects
- system/prefs/compizconfig: disabled all "effects" & "extras" settings & session manager (into utils)

on next reboot: first login always fail, second is fine. Got these errors logged:

polkitd(authority=local): Unregistered Authentication Agent for session /org/freedesktop/ConsoleKit/Session3 (system bus name :1.37, object path /org/gnome/PolicyKit1/AuthenticationAgent, locale fr_FR.utf8)
gdm-session-worker[1562]: pam_succeed_if(gdm:auth): requirement "user ingroup nopasswdlogin" not met by user "oem"
gdm-session-worker[1562]: pam_unix(gdm:auth): conversation failed
gdm-session-worker[1562]: pam_unix(gdm:auth): auth could not identify password for [oem]
gdm-session-worker[1562]: pam_winbind(gdm:auth): getting password (0x00000388)
gdm-session-worker[1562]: pam_winbind(gdm:auth): Could not retrieve user's password
gdm-session-worker[2716]: pam_succeed_if(gdm:auth): requirement "user ingroup nopasswdlogin" not met by user "oem"
gdm-session-worker[2716]: pam_unix(gdm:session): session opened for user oem by (uid=0)
gdm-session-worker[2716]: pam_ck_connector(gdm:session): nox11 mode, ignoring PAM_TTY :0

Revision history for this message

Bilal Akhtar (bilalakhtar) wrote on 2010-09-05:

#26

Sep 5 10:56:23 bilal-laptop gdm-session-worker[2039]: pam_succeed_if(gdm:auth): requirement "user ingroup nopasswdlogin" not met by user "bilal"
Sep 5 10:56:28 bilal-laptop gdm-session-worker[2039]: pam_unix(gdm:session): session opened for user bilal by (uid=0)

It appears that GDM is putting up a false requirement for login , which is failing when we press enter.

Revision history for this message

dino99 (9d9) wrote on 2010-09-05:

#27

new tweak: removing/purging everything about compiz, then rebooting

this time this issue is gone: login work as it might, reinstalling compiz make this issue back again: so as this happen on maverick and not with lucid (without compiz), it seems to me that something into pam/gdm/compiz settings process is conflicting with password built with letters+num

my gdm login is roc53sa: so into gdm login, i first enable numlock then validate my password, and i've often seen numlockx been disabled when re-entering password for app or service. The problem exist with/without compiz and cant find what is disabling numlock: i only see it disabled when pressing "enter" to validate a password, so is it a keyboard setting issue or a utf8-FR or numlockx itself or gdm ?

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-05:

#28

I'm an Ubuntu amateur, so believe what Steve says over what I say. He's no amateur.

I don't think the message "requirement "user ingroup nopasswdlogin" not met by user "bilal" " is a problem. I see from the forum post http://ubuntuforums.org/showthread.php?t=819198 that this is a requirement that a userid like "guest" might meet so that no password is required for that particular userid. Userid's requiring passwords should not be in this group. If I'm right, and if the userid bilal and oem require passwords, that particular message in the log would be expected.

The reason I suggested in now closed duplicate bug #529230 that someone look in the auth file was to look for evidence that the getty program , responsible for getting people logged in on virtual consoles 1-6, was active seeing keystrokes coming from the xsession and getting confused. This only happens if x is started on tty1-tty6, as is happening to Jamie. See Steve's comment 18 above.

This probably isn't happening to Bilal. His xorg.0.log.old has the X crash and that log says it started on vt7. Andres says his crashing X is starting on vt7.

I haven't gotten my head around dino99's report, but I don't think most people who have had this problem have had a problem with authentication, only with interference with x, either from plymouth (which should have quit just before X started, if I'm correct), or from getty. In the case of getty, I think the interference only crashes X after getty thinks, mistakenly, that it has seen several login attempts and restarts. It messes with the virtual terminal when it does that and that does in X when you next press the 2 or enter key. That might be the reason for Jamie's delay before the crash.

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-05:

#29

@Jamie:
What are you seeing right before gdm brings up the login screen? In particular, are you watching the splash screen?

From the command "sudo ls -l /var/run/gdm" when was the firstserver.stamp file created?

Revision history for this message

Chris Halse Rogers (raof) wrote on 2010-09-06: Re: X starts on wrong tty: pressing enter after 5 minutes crashes X

#30

So, I think there may be two separate issues here:
1) X started on the wrong tty, resulting in getty eventually timing out and resetting the tty input, manifests as “X crashes when pressing enter after 5 minutes or so”.
2) X started on the correct tty, but with incorrect input settings, manifests as “X crashes as soon as I press enter or 2 in gdm”. In this case, tty7 has some plymouth garbage text on it, including some “\▉”.

dino99 & Jamie Krug seem to have bug (1). Bilal & I have bug (2).

I think this bug should continue to be a discussion of bug (1), where X starts on the wrong tty and then gets killed after a timeout. Further work on bug (2) can continue on bug #626723.

summary:

- Pressing Enter crashes both X and plymouth in graphical mode
+ X starts on wrong tty: pressing enter after 5 minutes crashes X

Revision history for this message

Bilal Akhtar (bilalakhtar) wrote on 2010-09-06:

#31

Okay, I agree with what Chris is saying, it is correct that whenever I switched back to tty7 after the crash, I saw ^C or \▉ over there.

Bilal Akhtar (bilalakhtar) on 2010-09-06

Changed in plymouth (Ubuntu):
importance:	Medium → High

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-06:

#32

@John:
I do see a splash screen. Everything is relatively quick, as I have an SSD in my new System76 laptop :D I see a blank screen with a blinking cursor, followed by a quick flicker to the splash screen, followed by another blank-screen/flicker, and then the login screen. I think there's a very quick flash of text either before or after the Ubuntu splash screen, which seems it might be the message I always see on tty7: "GLib-WARNING **: getpwuid_r(): failed due to unknown user id (0)".

As for the timestamp of the file /var/run/gdm/firstserver.stamp, it seems to be the timestamp of the most recent clean boot.

Thanks to all for the help. It sounds like I'm commenting on the right bug, which is exciting in and of itself. I have a friend who purchased a nearly identical System76 Serval laptop around the same time, and he's seeing exactly what I'm reporting as well. If there's any other logs or tests I can share to provide feedback, I'm eager to help out!

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-06:

#33

Download full text (3.1 KiB)

@Jamie:

Since I'm only another member of the community it's completely possible that I won't be able to see this through to a fix. A glance from a plymouth or gdm developer may mean more progress than I could make in days. If it's ok with you, though, maybe we can move this towards a successful triage state on our own so it will be easy for a developer to find the problem. I appreciate your willingness to try things. I may not recognize when we have enough info.

I think your reply was very interesting.

1. You have a system that flies. This bug may be a problem caused by a race condition. A faster system may have different results than a slower one. How many processors or cores does your system have? What speed? I ask this to look for correlations between your report and those by others.

2. You are configured to use splash and it is working.

3. The timestamp on your file is correct. Because you were occasionally starting on tty2, I had myself half convinced that it would have an old date and that gdm wasn't even trying to start on the same virtual terminal as plymouth. It may be that gdm thinks plymouth's active terminal is tty1 (or maybe even tty2).

4. Between the time you see the splash and you see the login you see a blink. There isn't supposed to be a blink. That may be where your virtual console is switching from tty7 to tty1 and X is sealing its fate by starting there (or may not).

Some basics first:

You said you are running plymouth version plymouth 0.8.2-2ubuntu2. What version of gdm and upstart are you running?

Could you try adding the boot option plymouth:debug in grub and on that boot recreate the problem? Could you then attach the plymouth debugging log, gdm log, and the log from the failing X. It will probably be in Xorg.0.log.old after you restart. The failing X log will contain a traceback and say something about a quit signal or another signal depending upon the key you pressed when it crashed.

@Dino:
If you are still with us. Would you mind terribly telling us, when it fails for you, what you see? What versions of plymouth, gdm, and upstart are you running? What is your computer like? Like Jamie, does the creation time of /var/run/gdm/firstserver.stamp match the login time (it should). I'm just trying to confirm there aren't two separate underlying reasons for starting X on tty1 or tty2.

Warning: If this is a race condition just adding plymouth:debug might change the timing and "fix" the problem by slowing plymouth down. The logs may still be helpful. I'm thinking that it's a race condition and that Dino's compiz experiment changed the timing. I certainly may be wrong.

The purpose of the logs is to make sure plymouth is running on tty7 as expected, to see if gdm is asking if plymouth is active on plymouth's virtual terminal, and to watch where X starts. The idea is to get an idea where things get directed to tty1. We won't need the auth logs. They simply helped establish the interference from the getty process. We already know if you are starting X on tty1 that we have this problem.

Whether you each have the time and inclination to continue on with me or not, thanks to you both for posting your experience...

@Jamie:

Since I'm only another member of the community it's completely possible that I won't be able to see this through to a fix. A glance from a plymouth or gdm developer may mean more progress than I could make in days. If it's ok with you, though, maybe we can move this towards a successful triage state on our own so it will be easy for a developer to find the problem. I appreciate your willingness to try things. I may not recognize when we have enough info.

I think your reply was very interesting.

1. You have a system that flies. This bug may be a problem caused by a race condition. A faster system may have different results than a slower one. How many processors or cores does your system have? What speed? I ask this to look for correlations between your report and those by others.

2. You are configured to use splash and it is working.

3. The timestamp on your file is correct. Because you were occasionally starting on tty2, I had myself half convinced that it would have an old date and that gdm wasn't even trying to start on the same virtual terminal as plymouth. It may be that gdm thinks plymouth's active terminal is tty1 (or maybe even tty2).

4. Between the time you see the splash and you see the login you see a blink. There isn't supposed to be a blink. That may be where your virtual console is switching from tty7 to tty1 and X is sealing its fate by starting there (or may not).

Some basics first:

You said you are running plymouth version plymouth 0.8.2-2ubuntu2. What version of gdm and upstart are you running?

Could you try adding the boot option plymouth:debug in grub and on that boot recreate the problem? Could you then attach the  plymouth debugging log, gdm log, and the log from the failing X. It will probably be in Xorg.0.log.old after you restart. The failing X log will contain a traceback and say something about  a quit signal or another signal depending upon the key you pressed when it crashed.

@Dino:
If you are still with us. Would you mind terribly telling us, when it fails for you, what you see? What versions of plymouth, gdm, and upstart are you running? What is your computer like? Like Jamie, does the creation time of /var/run/gdm/firstserver.stamp match the login time (it should). I'm just trying to confirm there aren't two separate underlying reasons for starting X on tty1 or tty2.

Warning: If this is a race condition just adding plymouth:debug might change the timing and "fix" the problem by slowing plymouth down. The logs may still be helpful. I'm thinking that it's a race condition and that Dino's compiz experiment changed the timing. I certainly may be wrong.

The purpose of the logs is to make sure plymouth is running on tty7 as expected, to see if gdm is asking if plymouth is active on plymouth's virtual terminal, and to watch where X starts. The idea is to get an idea where things get directed to tty1. We won't need the auth logs. They simply helped establish the interference from the getty process. We already know if you are starting X on tty1 that we have this problem.

Whether you each have the time and inclination to continue on with me or not, thanks to you both for posting your experiences.

Revision history for this message

dino99 (9d9) wrote on 2010-09-07:

#34

:0-slave.log Edit (566 bytes, text/plain)

hi John,

here are your #33 requests:

my system:
home made desktop: p5wdh+q9550+8500gt+4go ddr2+2pata+1sata
os: maverick i386 with latest main server updates (35-20-generic-pae)
packages: plymouth 0.2.8.2-2ubuntu3, gdm 2.30.5-0ubuntu2, upstart 0.6.6-3, nvidia-current 256.53-0ubuntu2

note: compiz completly removed/purged

how it boot:

start as it might, gdm ask for password, i enable numlock as my password is built with num+letters, write the password and validate (enter): at that time gdm fail and it ask again for password, but numlock is again disabled (strange), i enabled numlock again and validate my password, now its working.

what is logged:

i've booted with plymouth:debug on the boot line.
/var/run/gdm/ : cant read it, the icon folder is greyed with an x, own by root:gdm (i've my user added to gdm group)

gdm-session-worker[1201]: GLib-GObject-CRITICAL: g_value_get_boolean: assertion `G_VALUE_HOLDS_BOOLEAN (value)' failed

Revision history for this message

dino99 (9d9) wrote on 2010-09-07:

#35

:0-greeter.log Edit (1.8 KiB, text/plain)

Revision history for this message

dino99 (9d9) wrote on 2010-09-07:

#36

plymouth-debug2.log Edit (21.0 KiB, text/plain)

some errors seen into debug.log

Revision history for this message

dino99 (9d9) wrote on 2010-09-07:

#37

root@dub:/var/run/gdm# ls
auth-for-gdm-tIW1mo auth-for-oem-Z9n9W2 firstserver.stamp

firstserver.stamp is an empty file

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-07:

#38

merci, Dino

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-07:

#39

@John: Thanks very much for helping this along.

I watched very closely when I booted this morning, and I'm certain that I'm seeing a flash of a text screen just after the Ubuntu splash, but before the GDM login screen, and I'm certain that it is showing this message near the top:
"GLib-WARNING **: getpwuid_r(): failed due to unknown user id (0)"

I can also confirm that this is the message, as always, at the top of the tty7 screen (I can press Ctrl-Alt-F7 before even logging in to confirm this). As always, I'm usually on tty2 (or tty1) after a boot, and tty8 after logging out and logging in again.

Here are my hardware specs:
System76 Serval Pro (SER-P6) laptop
Ubuntu 10.04 (Lucid Lynx) 64 Bit Linux
Core i7-840QM Processor ( 45nm, 8MB L3 Cache, 1.86GHz )
8 GB - DDR3 1333 MHz
Nvidia GeForce GTX 285M Graphics with 1GB GDDR3 Video Memory
160 GB Intel X25-M Solid State Drive

I can provide the logs you requested soon, but have a couple questions. To add the boot option plymouth:debug in grub, would I just append to the GRUB_CMDLINE_LINUX_DEFAULT parameter in my /etc/grub/default file and run update-grub? So, it would look like this:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash plymouth:debug"

I'm also uncertain as to where the plymouth debugging log will be located.

As for a gdm log, I see a number of different logs under /var/log/gdm/ -- which one should I provide?

Revision history for this message

dino99 (9d9) wrote on 2010-09-07:

#40

hi Jamie,
have done it by editing the boot line (E) into grub menu ( added at end of quiet splash ) then boot (ctrl+x)

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-07:

#41

On Tue, Sep 7, 2010 at 10:53 AM, dino99 <email address hidden> wrote:
> hi Jamie,
> have done it by editing the boot line (E) into grub menu ( added at end of quiet splash ) then boot (ctrl+x)

Yep. That's what I do, too. As soon as the grub menu appears I move the arrow to the item I want to boot and press e for edit. Then I add (or subtract) what I want and press ctr+x. I think your method would work too, though. The method Jamie and I use is just for one boot and the one you suggest would continue to set the option every time you boot until you change it back.

With regard to which /var/log/gdm files, I would open a terminal and do a "sudo ls -ltr /var/log/gdm" and look at the bottom of the output for the files that were created since the boot. The you can copy with "sudo cp /var/log/gdm/FILE ~/Desktop/" or some such.

Jamie, please remember to post your upstart and gdm versions so your all your logs can be matched up with the software version you are running. Please forgive me if you included it above and I missed it. I see that you are running the same version of plymouth I am. (Dino's running maverick and has a newer version).

You are both very patient.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-07:

#42

John, here's the version info you requested:
gdm 2.30.2.is.2.30.0-0ubuntu3
upstart 0.6.5-7
plymouth 0.8.2-2ubuntu2
nvidia-current 256.52-0ubuntu0sarvatt3~lucid

Next I'll reboot with plymouth debugging and post some logs...

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-07:

#43

Okay, this is really strange--I suddenly have no /etc/grub/ directory. I've edited /etc/grub/default before, and I'm certain I had a couple backup files in there too. Any thoughts???

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-07:

#44

Breaking news! John, I think you're definitely on to something with your idea of this being a race condition. When I just rebooted, the periodic disk check ran (at the splash screen, essentially), so I paid close attention, remembering the race condition hunch. I noted that there was no flicker before transitioning to the login screen, and low and behold: I was on VT7 after boot for the first time ever, to the best of my knowledge! Okay, I'll get back to trying to get some plymouth debugging and logs for you, but wanted to report this.

Revision history for this message

Steve Langasek (vorlon) wrote on 2010-09-07: Re: [Bug 625239] Re: X starts on wrong tty: pressing enter after 5 minutes crashes X

#45

On Tue, Sep 07, 2010 at 05:59:47PM -0000, Jamie Krug wrote:
> Okay, this is really strange--I suddenly have no /etc/grub/ directory.
> I've edited /etc/grub/default before, and I'm certain I had a couple
> backup files in there too. Any thoughts???

This is /boot/grub, not /etc/grub.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-07: Re: X starts on wrong tty: pressing enter after 5 minutes crashes X

#46

Steve: I've always modified /etc/grub/default in the past, as suggested here:
https://wiki.ubuntu.com/Grub2#grub (/etc/default/grub)

...and in `cat /etc/grub.d/README`:
All executable files in this directory are processed in shell expansion order.
  00_*: Reserved for 00_header.
  10_*: Native boot entries.
  20_*: Third party apps (e.g. memtest86+).
The number namespace in-between is configurable by system installer and/or
administrator. For example, you can add an entry to boot another OS as
01_otheros, 11_otheros, etc, depending on the position you want it to occupy in
the menu; and then adjust the default setting via /etc/default/grub.

I know I've modified /etc/grub/default and run update-grub a few times on this particular laptop as well--just not sure what happened or what I did to make it disappear :P

I'm curious, is it safe to simply reinstall plymouth, upstart and grub-pc from Synaptic, just to get me back to the defaults? I would assume so, but those are critical packages that I'm a bit nervous about ;-)

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-07:

#47

FYI: In an attempt to keep my environment as clean as possible, I've removed the ppa:ubuntu-x-swat/x-updates repo, which I'd added earlier in an attempt to fix this issue (http://ubuntuforums.org/showthread.php?t=1545846). Since this did not work, I've removed it and reinstalled my nvidia-current driver. So I now have:
nvidia-current 195.36.24-0ubuntu1~10.04

That said, keep in mind that I've noted identical behavior with this bug with both nvidia-current 195.36.24-0ubuntu1~10.04 and 256.52-0ubuntu0sarvatt3~lucid.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-07:

#48

Doh! Sorry for the confusion, I was just looking for /etc/grub/default instead of /etc/default/grub. I'll have another cup of coffee and try this again ;-)

On another note, in an effort to get that file back (the one that I'd a mistaken path for, which was there all along), I installed startupmanager, which is just a GUI editor for GRUB2. I didn't set anything fancy, but after doing so, I rebooted, and I see an uglier splash screen, however, I've booted to tty7 twice in a row! I wish I knew which change caused this, but I have a feeling that something minor in the grub2 config has made this issue disappear for the moment.

I've compared a generic /etc/default/grub to the "good" one that was just generated by startupmanager, and there was only one very tiny difference...
BAD:
GRUB_CMDLINE_LINUX=""
GOOD:
GRUB_CMDLINE_LINUX=" splash vga=799"

The latter booted me to tty7 twice in a row! How on earth?? To be sure it wasn't a fluke or some other change, I uninstalled startupmanager, copied the old/original /etc/default/grub over the generated one, ran update-grub and rebooted. Sure enough, I was on tty2 and saw the same old error message on tty7; and also logged out and logged in to find myself on tty8 again.

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-07:

#49

I don't have /etc/grub/default either. I do have /etc/default/grub. The file names are confusing.

I see no reference to the former in update-grub or grub-mkconfig, but grub-mkconfig includes the contents of /etc/default/grub. Are you missing /etc/default/grub?

I sometimes get nervous about grub, too. I do keep my boot cd's handy just in case.

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-07:

#50

Our comments crossed. Sorry.

Please take your time and only do things you are ok with.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-07:

#51

Okay, I've done some experimenting with my linux command options. If I have vga=799 on there, I boot to tty7 every time. With or without the "quite" and/or "splash" options, I see the same results. However, when I tried vga=771, I end up on tty1 or tty2 again.

So, for whatever reason, this line in my /etc/default/grub file has made this bug disappear for me:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash vga=799"

This makes the splash screen show a more text-like "Ubuntu" instead of the image logo I saw before, but I could care less what boot looks like at this point, if there are no more GDM crashes!

Ironically, this seems to suggest that the "vga" option used (again, originally generated when I tried startupmanager) is deprecated:
https://bugs.launchpad.net/startup-manager/+bug/483262

My guess is that this is just a lucky workaround. It seems the most likely guess (from me, random community member:) is that there's still a race condition, but this option is somehow disguising it?

I'll try to find some time later to post some logs still, with plymouth debugging on and such. But I have a tentative workaround, on my laptop, I think?!?

Revision history for this message

dino99 (9d9) wrote on 2010-09-07:

#52

hi John,

have you seen errors logged into # 36 ? seem interesting to digg around the libs issues

to Jamie: please stay tuned to this report subject only, launchpad isn't a forum or chat.

Revision history for this message

Carl Richell (carlrichell) wrote on 2010-09-07:

#53

Hi Jamie,

Great find! With vga=799 my Serval boots on vt7 every time. As you noted vga=799 is depreciated and replaced by gfxpayload. You can have the correct boot image and start on VT7 by adding the following lines to /etc/default/grub and then "sudo update-grub".

---- I have tested this on the 1600x900 Serval display. Please let me know the results on your 1920 x 1080 LCD.

add GRUB_GFXMODE and GRUB_GFXPAYLOAD_LINUX as below.

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480
GRUB_GFXMODE=1920x1080
GRUB_GFXPAYLOAD_LINUX=1920x1080

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-08:

#54

Cool, Jamie. I'm glad you found a work-around, and I appreciate your kind offer to post the logs later. I'd like to see how they match up with dino's, both with regard to the messages dino saw in his plymouth log post, and with regard to the little dance gdm and plymouth do to keep on the same VT7 (and not flicker). For now I'm thinking about how that dance could go wrong.

It may be that changing to the text splash is changing a race condition, but it may also be that the code in plymouth for the graphics splash screen has a bug that the text screen software doesn't have. I'm not betting one way or the other.

@Dino:
I think the messages are ok, but I certainly may be mistaken. In the first case, while x11.so isn't being used, the program then tries drm.so, which seems to be working since there is not an error message. I would guess that Jamie's non-working configuration may be using drm.so also, and that his text mode splash may be using frame-buffer.so or vga16fb.so, but we will have to wait to see.

In the second case, plymouth is trying several ways to get configuration information. It finally uses /lib/plymouth/themes/default.plymouth which tells plymouth to use the ubuntu splash screen. I'm predicting Jamie and all other Ubuntu users will see the same. It simply knows how it should be configured by what file it can read and from what files it cannot read.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-08:

#55

@Carl: I'm sorry to say that your suggested changes to /etc/default/grub did not work for me. I noticed that 1920x1080 was not listed when I tried the vbeinfo command from a GRUB command prompt, as suggested in GRUB2 docs/help. I also tried a resolution that *is* listed as supported for me, 1280x800. This one made the splash look pretty, but I still had an error on VT7 and ended up on VT2. I logged out and back in to get on VT8 again, and quickly had another reset occur when simply clicking a button in Chrome. Shortly after this, I tried combining the GRUB_GFXMODE and GRUB_GFXPAYLOAD_LINUX settings with the vga=799 on the linux line, but I then received very erratic behavior. I think I was booted to VT8, but had a GDM crash then bump me to VT9, and I also ended up getting bounced back to VT8 from VT9 with another GDM crash at some point. So, as it remains, it's still just adding vga=799 to GRUB_CMDLINE_LINUX_DEFAULT (or GRUB_CMDLINE_LINUX) that seems to be the one mysterious workaround for me.

@dino99: I presume you're referring to my comments #43 and #46? I do apologize, as they got off topic, but I was convinced that a file disappeared due to testing workarounds for this bug. Unfortunately, I'd jumped to conclusions and it was merely a silly oversight on my part. Back to the bug ;-)

Revision history for this message

dino99 (9d9) wrote on 2010-09-08:

#56

new test: booting without "vga=xxx quiet splash"

it give the same issues:
- first login fail and reset numlock
- second login usually works, except this time (few minutes ago) it crashed (fall back to gdm) while writing a previous comment, but without resetting numlock

so, what i think is:
- as this last crash have happened about 10 minutes after the real login (second try), its hard to blame plymouth on that case, might think about gdm / X instead (cant see where we could find a race)

- while booting without quiet/vga/splash, i've not seen any complaint, so i'm only thinking to blame gdm or one of its dependencies (maybe some wrong rights)

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-08:

#57

Jamie and Dino:

If I understand, you each reported a problem with an X crash when X had already crashed. As Dino pointed out, that's very different. Would you please post the failing Xorg log? You can probably figure out which one by the time on the file.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-08:

#58

plymouth-debug.log Edit (18.0 KiB, text/plain)

I tried editing the boot line from the grub menu and added plymouth:debug to the end, but then the boot just froze at the splash screen. I was unable to switch to another VT or find any way to get rid of the splash screen, so had to do a hard power off. It did, however, produce a debug log file, which I'm attaching.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-08:

#59

Xorg.2.log.old Edit (18.7 KiB, application/x-trash)

I reproduced the bug and will now attach these three files, which all had the time stamp of the crash/reset to GDM login screen (and as usual, was on VT1 after boot, then VT8 after crash and second login):
/var/log/Xorg.2.log.old
/var/log/gdm/:2.log.1
/var/log/gdm/:2-slave.log.1

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-08:

#60

:2.log.1 Edit (15.2 KiB, text/plain)

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-08:

#61

:2-slave.log.1 Edit (664 bytes, text/plain)

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-08:

#62

gdm.tar.gz Edit (8.4 KiB, application/x-tar)

John:
Unfortunately, I don't seem to have an Xorg log for yesterday's crashes. I'll try to be more diligent about grabbing those, but I had some quick unexpected X crashes when experimenting with grub settings yesterday. I do have a bunch of logs dated for yesterday under /var/log/gdm/, but I don't know what I'm looking for. I'm attaching a .tar.gz of everything under /var/log/gdm/ from yesterday.

Revision history for this message

dino99 (9d9) wrote on 2010-09-08:

#63

Xorg.0.log.old Edit (15.0 KiB, text/plain)

have a backtrace on gdm fallback

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-08:

#64

Thanks for sticking with this even though you guys have found circumventions. If someone can fix this without adding boot parameters, etc. -- if it just works-- it will be better for many Ubuntu users.

@Jamie:

It looks like :2.log.1 is what I would expect for the bug we have been hunting. It was started quiet splash and crashed on VT2 after hitting Enter key rather than hitting the 2 key.

:0.log.2 started on VT7 and X couldn't get started. It suggests looking at the corresponding Xorg.0.log file and some other files. It says no screens are available. It may mean what you were trying in booting it up wasn't giving you a good graphic environment to use, but I don't know that. The complaint it has is:
(EE) Sep 07 22:16:47 NVIDIA(0): Failed to initialize the NVIDIA kernel module. Please see the
(EE) Sep 07 22:16:47 NVIDIA(0): system's kernel log for additional error messages and
(EE) Sep 07 22:16:47 NVIDIA(0): consult the NVIDIA README for details.
(EE) NVIDIA(0): *** Aborting ***
(EE) Screen(s) found, but none have a usable configuration.
Perhaps it is a non-drm nvidia boot where the X driver and kernel driver don't match? I really don't know, but it isn't what we have been looking at and doesn't appear to have anything to do with plymouth or splash screens.
:0.log.3 looks like a good splash start on VT8
:0.log.4 looks like an X crash during one of your VGA= experiments. I can't tell what the crash was and don't know if it is the same as what we have been looking at. It doesn't match :0.log.2, though.

:1.log.2 looks like a good session on VT8.
:1.log.3 looks like a session on VT2 that ended before it crashed
The rest are unremarkable as far as I can tell right now.

Here's the thing. The file I was most interested in seeing was your plymouth-debug.log. The one you posted shows the same error message Dino and I have (I'm not crashing). It shows the same variation in error messages we expected (you are using the fourth renderer). It shows gdm knows it is running and asks it to get ready to go away. But then it seems to end rather abruptly, unlike Dino's.

If you copies and pasted it could you have missed some of it?

Could you put your system back to the original state except for asking for plymouth:debug and attaching another so we can confirm it stops there.

If that's all there is then plymouth crashed or was killed by another process before gdm started X, and that differs from Dino's case where it's the same to the point where gdm asks plymouth to get ready to go away (deactivate). THEN Dino's (and mine) ask it if it is controlling the VT you are looking at, and then starts X and later asks plymouth to quit.

@Dino:
Xorg.0.log looks to me like a Enter key crash on VT1, but when you booted without the splash or quiet parameters. I assume you didn't see a splash screen that time (plymouth would still run, I believe). Good confirming log.

Thanks for sticking with this even though you guys have found circumventions. If someone can fix this without adding boot parameters, etc. -- if it just works-- it will be better for many Ubuntu users.

@Jamie:

It looks like :2.log.1 is what I would expect for the bug we have been hunting. It was started quiet splash and crashed on VT2 after hitting Enter key rather than hitting the 2 key.

:0.log.2 started on VT7 and X couldn't get started. It suggests looking at the corresponding Xorg.0.log file and some other files. It says no screens are available. It may mean what you were trying in booting it up wasn't giving you a good graphic environment to use, but I don't know that. The complaint it has is: 
(EE) Sep 07 22:16:47 NVIDIA(0): Failed to initialize the NVIDIA kernel module. Please see the
(EE) Sep 07 22:16:47 NVIDIA(0):     system's kernel log for additional error messages and
(EE) Sep 07 22:16:47 NVIDIA(0):     consult the NVIDIA README for details.
(EE) NVIDIA(0):  *** Aborting ***
(EE) Screen(s) found, but none have a usable configuration.
Perhaps it is a non-drm nvidia boot where the X driver and kernel driver don't match?  I really don't know, but it isn't what we have been looking at and doesn't appear to have anything to do with plymouth or splash screens.
:0.log.3 looks like a good splash start on VT8
:0.log.4 looks like an X crash during one of your VGA= experiments. I can't tell what the crash was and don't know if it is the same as what we have been looking at. It doesn't match :0.log.2, though.

:1.log.2 looks like a good session on VT8.
:1.log.3 looks like a session on VT2 that ended before it crashed
The rest are unremarkable as far as I can tell right now.

Here's the thing. The file I was most interested in seeing was your plymouth-debug.log. The one you posted shows the same  error message Dino and I have (I'm not crashing). It shows the same variation in error messages we expected (you are using the fourth renderer). It shows gdm knows it is running and asks it to get ready to go away. But then it seems to end rather abruptly, unlike Dino's.

If you copies and pasted it could you have missed some of it?

Could you put your system back to the original state except for asking for plymouth:debug and attaching another so we can confirm it stops there.

If that's all there is then plymouth crashed or was killed by another process before gdm started X, and that differs from Dino's case where it's the same to the point where gdm asks plymouth to get ready to go away (deactivate). THEN Dino's (and mine) ask it if it is controlling the VT you are looking at, and then starts X and later asks plymouth to quit.

@Dino:
Xorg.0.log looks to me like a Enter key crash on VT1, but when you booted without the splash or quiet parameters. I assume you didn't see a splash screen that time (plymouth would still run, I believe). Good confirming log.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-08:

#65

plymouth-debug_20100908-1347.log Edit (22.4 KiB, text/plain)

More test results, log, and another workaround:
- booting with "plymouth:debug" added to boot line, but "quiet splash" removed, results in booting to VT7 every time, and no crashes (sample plymouth debug log attached).
- booting with "quiet plymouth:debug" (but no "splash"), same results--boots to VT7 every time, and no crashes.
- booting with "quiet splash plymouth:debug" results in a frozen splash screen--splash logo, 5 dots turn red, then nothing at all.
- simply removing the "quiet splash" from boot line (and no "plymouth:debug" either) results in bug behavior--boots to VT2 and eventually crashes on pressing enter (again, rarely the first time though, maybe once).

@John: Unfortunately, this means I really can't provide the plymouth-debug.log you're looking for. The only condition that allows me to boot with plymouth debugging seems to be another mysterious workaround to this issue. Again, is it possible that simply adding the plymouth:debug option may be just enough to delay whatever race condition might be causing this bug? I'm clueless to why I'm getting a splash freeze with plymouth:debug.

Revision history for this message

Carl Richell (carlrichell) wrote on 2010-09-08:

#66

@Jamie,

I tested a Serval with a full 1080p HD LCD (1920x1080). Interestingly, 1600x900x32 provides a clean splash screen and starts on VT7.

GRUB_GFXMODE=1600x900x32
GRUB_GFXPAYLOAD_LINUX=1600x900x32

Also, it appears this bug is fixed in Maverick. However, the splash screen when using nVidia's proprietary driver is text.

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-08:

#67

@Carl,
Dino99 has the problem with Maverick.

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-09:

#68

@Jamie,
Do you remember how you got the first log?

Could the log you posted be from some time it hung? If it's complete it does look like plymouth crashed. For the crash situation does anything get recorded in the plymouth-debugging log?

Maybe we should look in your /var/log/syslog from the time of a crash to look for any messages related to what is happening.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-09:

#69

log-crash-from-vt8-after-2nd-login.tar.gz Edit (7.7 KiB, application/x-tar)

@Carl,
Those grub resolution settings wreak havoc on me. I booted to vt2, so logged out, logged in, on vt8. Then, for the first time ever, I had the X reset after a logout/login cycle (some logs attached for this). Shortly thereafter, I had yet another crash, and these did not seem to be related to pressing Enter this time. Clearly, messing with GRUB_GFXMODE/GRUB_GFXPAYLOAD_LINUX creates a very volatile environment for me. I also tried keeping those settings, but using one of my previously successful workarounds (plymouth:debug on boot line, no "quiet splash"), and I did in fact boot to vt7, but very quickly had a couple more crashes.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-09:

#70

syslog-20100908-1000.txt Edit (133.1 KiB, text/plain)

@John,
Yes, the first plymouth-debug.log was generated the first time I added "plymouth:debug" to the boot line, but I had a freeze on the splash screen and powered off. When I booted again without plymouth:debug I found that log file there, so it obviously generated that prior to the freeze and me powering off. To your second question, again, I'm unable to boot with plymouth:debug and recreate the crash--it's another workaround, which is unfortunate for gathering diagnostics in this situation ;-)

I'm attaching a section of my syslog file. The Xorg and GDM logs I posted in comments #59 through #62 were related to a crash that occurred around Sep 8 10:00 (10:00:17, to be exact, it seems). So the syslog file lines I'm attaching should cover that entire boot-to-crash cycle.

Thanks again for the continued assistance on this!

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-09:

#71

It's a shame about the crash getting in our way, but oh well. I don't see anything in the log either, sad to say. the acpid messages at 10:00:17 are from an unrelated daemon that tells interested processes on your computer about closing the lid, etc., if I'm not mistaken.

Could you recreate the situation where X is on tty1 and, before X has a chance to quit, switch to a different tty, login, and save the output of the ps aux command? e.g. ps aux >~/ps.out and then attach ps.out?

In particular the X command in the process list should tell us how gdm is starting X when there is going to be an X crash on Enter. Steve suggested this command in another bug report for someone having a similar problem.

I'm still reading source code.

Revision history for this message

dino99 (9d9) wrote on 2010-09-09:

#72

maybe some good news:

http://www.ubuntuupdates.org/packages/show/216647

in the past few hours we have had some fixing about apt too, so there is some hope about better conf.

i've made some more tweaks: purging/reinstalling gdm plymouth jockey apt-xapian and updating packages ( my issues are only with maverick) . The mountall package seem to be related with our problem (see the comments in the above link).

On my end, the situation seem to be better since the latest apt & mountall upgrades: after this install i've rebooted, logged only 1 time as it might, no crash, so i'm waiting to see what happen now before commenting again.

to John, from # 64:
*** @Dino:
Xorg.0.log looks to me like a Enter key crash on VT1, but when you booted without the splash or quiet parameters. I assume you didn't see a splash screen that time (plymouth would still run, I believe). Good confirming log. ****

you are right, it was only a gdm fallback asking to log again, no more.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-09:

#73

ps.out Edit (30.0 KiB, text/plain)

John, I've attached ps aux output while X is on tty2 (this is usually what happens, but occasionally tty1--eventual crash in both cases). Thanks.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-09:

#74

@dino99: I installed updates this morning, which included the mountall update for Lucid. I rebooted, but still ended up with X on tty2, and crashed to the login screen after about 30 minutes of bouncing around in Chrome browser tabs.

Revision history for this message

dino99 (9d9) wrote on 2010-09-09:

#75

hi Jamie,

good news on my end: since #72 and its tweaks, my issues are gone, now gdm logging is ok, tested several reboots along this day and no problem.

So i might conclude that old (udev) conf was breaking this gdm/X stuff, dont know exactly where to find the good explanations as i've purged/reinstalled several packages at once (have removed lot of python files too) and finished by reinstalling ubuntu-desktop.

Its not the first time that ive fixed issues by purging/reinstalling as its the easiest way to fix lot of troubles, sadly "dpkg configure" and/or "dpkg-reconfigure" is not clever enough to rebuild good conf, and other tools like gtkorphan / gconf-cleaner / bleachbit, that i often used, dont know how to remove these so many old /wrong settings left behind. Ubuntu / Linux might work on these tools instead of spending time fixing such garbage.

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-09:

#76

@Dino
Could you attach the output from ps as Jamie did? If it starts happening again perhaps you can do the same again during a session you expect to crash?

@Jamie and Dino
Could you look again at that file creation time?

sudo ls -l /var/run/gdm/firstserver.stamp

It should exist and be empty, it's its creation/modification time that is important. It should be created the first time gdm starts at each boot, which is what Jamie reported above. It must not exist before gdm starts. If it is older than the last boot there may be a problem with the /var/run file system.

@Jamie
That ps.out file floored me. Unless I've misread the code, it looks to me like you and Dino have two different underlying causes for this bug.

By the way, I assume anybody who has this bug can log out immediately after first logging in and won't have a subsequent crash.

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-09:

#77

It would be great if anyone else having the problem could post the output from the ps command on their system while they are running X on tty1 through tty6.

@Dino:
That's great news!

Revision history for this message

dino99 (9d9) wrote on 2010-09-09:

#78

ps.out Edit (20.2 KiB, text/plain)

hello John,

some readings for you !!

oem@dub:~$ sudo ls -l /var/run/gdm/firstserver.stamp
[sudo] password for oem:
-rw-r--r-- 1 root root 0 2010-09-09 16:03 /var/run/gdm/firstserver.stamp

--> time is ok with my last logging

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-09:

#79

Hi Dino
Thanks for the information. That's good about firstserver.stamp.

From the ps.out I assume that the bug is still gone on your system since X started on tty7. If it ever fails again (starting on tty1-tty6) that time stamp, the output of ps before X crashes, and, if possible, the output in /var/log/plymouth-debug.log would be good information to gather to restart an investigation (from *my* perception).

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-09:

#80

Ok, joining here by Gruber's request from #532047. I have temporarily removed my workaround from my comment #91 there, so I'm testing against a 'standard' lucid install on a very fast booting system with i386 generic-pae, 8Gb RAM, Q9550 and nvidia-current.

(a) adding plymouth:debug affects timing enough to stop the problem from occurring (all my X sessions in 4 reboots ended up on vt7).
(b) removed the plymouth:debug again
(c) next reboot, my disks got fsck-ed - no problem (timing issue, once again)
(d) next reboot my X session ends up on tty3. I can see the -isig flag with 'stty -F /dev/tty3'. I can safely hit enter in the greeter. I did not login.
(e) Instead, I help fate a little by doing 'stty -F /dev/tty3 isig' from the vt1, then switch to vt3 and press the '2' key. Lo and behold: immediate X crash (this is the first time I witnessed a crash from another trigger key than enter, I think). After the crash, X is relaunched on tty8

(f) another reboot, X session ends up on tty3 again and I start entering this comment. Upon entering the thrid digit of the bug number reference (a '2') the X session vanishes. Much cursing ensues :) and I relogin to start over. Note that I did not notice the scrolllock light as a premonition this time (it might not be deterministic).

So no real news, other than a heap of confirmations for you:

(A) vt allocation race is a root cause
(B) enter as well as '2' key can crash the session if vt is 'armed' (+isig)
(C) scroll lock premonition is not deterministic
(D) vt allocation is not deterministic (varies from vt2, vt3 mostly to vt7 on 'slow boot' with any delaying option or event)

Last but not least, I noticed (again) that on changing to vt1 for the analysis steps, some of the time my primary display's output would be garbled. Changing back and forth a few times would show several different distortions, until finally the output becomes ok across both displays. I'm not sure whether the 'garbling' occurs if X was on vt7 from the start. Also, I don't know, if this could be related and if it was related, whether the nvidia driver might be _causing_ some of the effects here, or whether the vt allocation issues are _causing_ the nvidia driver to have trouble when switching vts.
I just thought I'd mention it because (for the first time) I thought it might actually be related. I have noticed occasional text console garbling (large block characters, random pixel noise in bright colours and other issues). Perhaps someone recognizes that behaviour?

So I'm sorry I cannot post any useful plymouth-debug.log. I'm not sure whether you're interested in an actual ps listing, or whether you were simply after the above tty numbers? (I use 'ps -ef | grep X' to see them).

PS. Just thought of attaching the Xorg.0.log from when the X session was crashed with the '2' key. I couldn't find the actual log (the log file naming convention seems magic to me), so I'll make a new one and post it later

Ok, joining here by Gruber's request from #532047. I have temporarily removed my workaround from my comment #91 there, so I'm testing against a 'standard' lucid install on a very fast booting system with i386 generic-pae, 8Gb RAM, Q9550 and nvidia-current.

(a) adding plymouth:debug affects timing enough to stop the problem from occurring (all my X sessions in 4 reboots ended up on vt7).
(b) removed the plymouth:debug again
(c) next reboot, my disks got fsck-ed - no problem (timing issue, once again)
(d) next reboot my X session ends up on tty3. I can see the -isig flag with 'stty -F /dev/tty3'. I can safely hit enter in the greeter. I did not login.
(e) Instead, I help fate a little by doing 'stty -F /dev/tty3 isig' from the vt1, then switch to vt3 and press the '2' key. Lo and behold: immediate X crash (this is the first time I witnessed a crash from another trigger key than enter, I think). After the crash, X is relaunched on tty8

(f) another reboot, X session ends up on tty3 again and I start entering this comment. Upon entering the thrid digit of the bug number reference (a '2') the X session vanishes. Much cursing ensues :) and I relogin to start over. Note that I did not notice the scrolllock light as a premonition this time (it might not be deterministic).

So no real news, other than a heap of confirmations for you:

(A) vt allocation race is a root cause
(B) enter as well as '2' key can crash the session if vt is 'armed' (+isig)
(C) scroll lock premonition is not deterministic
(D) vt allocation is not deterministic (varies from vt2, vt3 mostly to vt7 on 'slow boot' with any delaying option or event)

Last but not least, I noticed (again) that on changing to vt1 for the analysis steps, some of the time my primary display's output would be garbled. Changing back and forth a few times would show several different distortions, until finally the output becomes ok across both displays. I'm not sure whether the 'garbling' occurs if X was on vt7 from the start. Also, I don't know, if this could be related and if it was related, whether the nvidia driver might be _causing_ some of the effects here, or whether the vt allocation issues are _causing_ the nvidia driver to have trouble when switching vts. 
I just thought I'd mention it because (for the first time) I thought it might actually be related. I have noticed occasional text console garbling (large block characters, random pixel noise in bright colours and other issues). Perhaps someone recognizes that behaviour?

So I'm sorry I cannot post any useful plymouth-debug.log. I'm not sure whether you're interested in an actual ps listing, or whether you were simply after the above tty numbers? (I use 'ps -ef | grep X' to see them).

PS. Just thought of attaching the Xorg.0.log from when the X session was crashed with the '2' key. I couldn't find the actual log (the log file naming convention seems magic to me), so I'll make a new one and post it later

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-09:

#81

Download full text (3.4 KiB)

Heya, that turned quite some interesting information.
I think (and hope) you may find these Xorg logs very interesting, as they certainly surprised me, which is why I attach all the four described log files. (Rest assured, 3 are tiny).

Here is what happened:

first of I had a number of reboots spontaneously landing my X desktop on vt7... bummer :) I did however confirm this way, that indeed switching to vt1 will _not_ garble my primary display output on my nvidia card when X is on vt7 proper.

On the next reboot, I had a garbled text console on vt1, so I knew before looking that the X session was probably not on vt7 that time: indeed it was on tty2! "-isig" was listed, so it wasn't armed and dangerous. However, looking at the X logs was a big surprise:

/var/log$ ls -ltr --full-time /var/log/Xorg.*
-rw-r--r-- 1 root root 6423 2010-09-10 00:40:01.096801267 +0200 /var/log/Xorg.0.log
-rw-r--r-- 1 root root 6423 2010-09-10 00:40:01.272800811 +0200 /var/log/Xorg.1.log
-rw-r--r-- 1 root root 15698 2010-09-10 00:45:17.064801287 +0200 /var/log/Xorg.2.log.old
-rw-r--r-- 1 root root 13521 2010-09-10 00:47:06.201536090 +0200 /var/log/Xorg.2.log

Those .0, .1 and .2 (old) logs are in surprisingly rapid succesion, not? Indeed the first two contain:

(EE) Sep 10 00:40:01 NVIDIA(0): Failed to initialize the NVIDIA kernel module. Please see the
(EE) Sep 10 00:40:01 NVIDIA(0): system's kernel log for additional error messages and
(EE) Sep 10 00:40:01 NVIDIA(0): consult the NVIDIA README for details.

just before their respective shutdowns. Only the third attempt (Xorg.2.log.old) worked. Now, this was the one that I actually logged onto, while it was running "-isig" on vt2. I played around a while waiting for a crash opportunity. Unfortunately that took a while. Therefore I switched to vt1, submitted 'stty -F /dev/tty2 isig', chvt 2 and was able to immediately cause X to crash by pressing '2' (sorry for being impatient). As you will be able to see, Xorg.2.log.old then gets terminated with the following sequence:

(II) Power Button: Close
(II) UnloadModule: "evdev"
(II) Power Button: Close
(II) UnloadModule: "evdev"
(II) Logitech Optical USB Mouse: Close
(II) UnloadModule: "evdev"
(II) AT Translated Set 2 keyboard: Close
(II) UnloadModule: "evdev"
(II) Macintosh mouse button emulation: Close
(II) UnloadModule: "evdev"

I suppose the many repeats of 'NVIDIA(0): Setting mode "CRT:1280x1024_75+0+0,DFP:2048x1152+1280+0"' before that are because of my repeated switching of vt1 to vt2 and back before the X session crashed. Also note that I recall seeing a different end sequence (involving the SIGINT keyword) a fair while back. This may be due to several reasons: (a) it might have been different before the allegedly released fix (something has been changed by ubuntu devs a while back) (b) it might have been the trigger key ('2' in this case); I used to only recall crashing with the Enter key, which might result in a different X termination mode :)

One more interesting observation: in none of my X sessions (be it on vt2 or vt8) have I been able to make normal use of the scroll lock light (pressing scroll lock does _not_ ...

Heya, that turned quite some interesting information. 
I think (and hope) you may find these Xorg logs very interesting, as they certainly surprised me, which is why I attach all the four described log files. (Rest assured, 3 are tiny).

Here is what happened:

first of I had a number of reboots spontaneously landing my X desktop on vt7... bummer :) I did however confirm this way, that indeed switching to vt1 will _not_ garble my primary display output on my nvidia card when X is on vt7 proper.

On the next reboot, I had a garbled text console on vt1, so I knew before looking that the X session was probably not on vt7 that time: indeed it was on tty2! "-isig" was listed, so it wasn't armed and dangerous. However, looking at the X logs was a big surprise:

/var/log$ ls -ltr --full-time /var/log/Xorg.*
-rw-r--r-- 1 root root  6423 2010-09-10 00:40:01.096801267 +0200 /var/log/Xorg.0.log
-rw-r--r-- 1 root root  6423 2010-09-10 00:40:01.272800811 +0200 /var/log/Xorg.1.log
-rw-r--r-- 1 root root 15698 2010-09-10 00:45:17.064801287 +0200 /var/log/Xorg.2.log.old
-rw-r--r-- 1 root root 13521 2010-09-10 00:47:06.201536090 +0200 /var/log/Xorg.2.log

Those .0, .1 and .2 (old) logs are in surprisingly rapid succesion, not? Indeed the first two contain:

(EE) Sep 10 00:40:01 NVIDIA(0): Failed to initialize the NVIDIA kernel module. Please see the
(EE) Sep 10 00:40:01 NVIDIA(0):     system's kernel log for additional error messages and
(EE) Sep 10 00:40:01 NVIDIA(0):     consult the NVIDIA README for details.

just before their respective shutdowns. Only the third attempt (Xorg.2.log.old) worked. Now, this was the one that I actually logged onto, while it was running "-isig" on vt2. I played around a while waiting for a crash opportunity. Unfortunately that took a while. Therefore I switched to vt1, submitted 'stty -F /dev/tty2 isig', chvt 2 and was able to immediately cause X to crash by pressing '2' (sorry for being impatient). As you will be able to see, Xorg.2.log.old then gets terminated with the following sequence:

(II) Power Button: Close
(II) UnloadModule: "evdev"
(II) Power Button: Close
(II) UnloadModule: "evdev"
(II) Logitech Optical USB Mouse: Close
(II) UnloadModule: "evdev"
(II) AT Translated Set 2 keyboard: Close
(II) UnloadModule: "evdev"
(II) Macintosh mouse button emulation: Close
(II) UnloadModule: "evdev"

I suppose the many repeats of 'NVIDIA(0): Setting mode "CRT:1280x1024_75+0+0,DFP:2048x1152+1280+0"' before that are because of my repeated switching of vt1 to vt2 and back before the X session crashed. Also note that I recall seeing a different end sequence (involving the SIGINT keyword) a fair while back. This may be due to several reasons: (a) it might have been different before the allegedly released fix (something has been changed by ubuntu devs a while back) (b) it might have been the trigger key ('2' in this case); I used to only recall crashing with the Enter key, which might result in a different X termination mode :)

One more interesting observation: in none of my X sessions (be it on vt2 or vt8) have I been able to make normal use of the scroll lock light (pressing scroll lock does _not_ cause it to toggle), making the fact that it can sometimes spontaneously flash directly before crashing X a bit more interesting.

Finally, you should know that the last logfile is from the relaunched X session, which is happily running on vt8 right now where I type this report.

Cheers

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-09:

#82

Xorg logs from the described session Edit (4.4 KiB, application/x-tar)

Must. Remember. Attach. File.

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-09:

#83

Ok two more things:
(a) strike the comments about garbling: this just happens when I switch to a text console too quickly after gdm starts (so the video is probably still initializing?). It also happens when X is nicely on vt7 and it doesn't happen with X on vt2, but with enough delay before switching to vt1. So, I'm guessing: unrelated

(b) established that when triggering the crash using the Enter key instead of '2' I get the other backtrace that I remembered from back in the beginning of my analyzing this bug (I think start/mid july). I reproduced this twice, and a third time from a perfectly normal vt7 session by manually going to vt1 and forcing isig on (stty -F /dev/tty7 isig):

(II) Sep 10 01:11:00 NVIDIA(0): Setting mode "CRT:1280x1024_75+0+0,DFP:2048x1152+1280+0"
(II) Power Button: Device reopened after 1 attempts.
(II) Power Button: Device reopened after 1 attempts.
(II) Logitech Optical USB Mouse: Device reopened after 1 attempts.
(II) AT Translated Set 2 keyboard: Device reopened after 1 attempts.
(II) Macintosh mouse button emulation: Device reopened after 1 attempts.

Backtrace:
0: /usr/bin/X (xorg_backtrace+0x3b) [0x80e938b]
1: /usr/bin/X (0x8048000+0x61c8d) [0x80a9c8d]
2: (vdso) (__kernel_rt_sigreturn+0x0) [0xb784f410]
3: /usr/bin/X (0x8048000+0x2a1b0) [0x80721b0]
4: /usr/bin/X (0x8048000+0x1ed7a) [0x8066d7a]
5: /lib/tls/i686/cmov/libc.so.6 (__libc_start_main+0xe6) [0xb7579bd6]
6: /usr/bin/X (0x8048000+0x1e961) [0x8066961]

Caught signal 3 (Quit). Server aborting

Please consult the The X.Org Foundation support
at http://wiki.x.org
for help.
Please also check the log file at "/var/log/Xorg.0.log" for additional information.

(II) Power Button: Close
(II) UnloadModule: "evdev"
(II) Power Button: Close
(II) UnloadModule: "evdev"
(II) Logitech Optical USB Mouse: Close
(II) UnloadModule: "evdev"
(II) AT Translated Set 2 keyboard: Close
(II) UnloadModule: "evdev"
(II) Macintosh mouse button emulation: Close
(II) UnloadModule: "evdev"
ddxSigGiveUp: Closing log

Ok two more things: 
(a) strike the comments about garbling: this just happens when I switch to a text console too quickly after gdm starts (so the video is probably still initializing?). It also happens when X is nicely on vt7 and it doesn't happen with X on vt2, but with enough delay before switching to vt1. So, I'm guessing: unrelated

(b) established that when triggering the crash using the Enter key instead of '2' I get the other backtrace that I remembered from back in the beginning of my analyzing this bug (I think start/mid july). I reproduced this twice, and a third time from a perfectly normal vt7 session by manually going to vt1 and forcing isig on (stty -F /dev/tty7 isig):

(II) Sep 10 01:11:00 NVIDIA(0): Setting mode "CRT:1280x1024_75+0+0,DFP:2048x1152+1280+0"
(II) Power Button: Device reopened after 1 attempts.
(II) Power Button: Device reopened after 1 attempts.
(II) Logitech Optical USB Mouse: Device reopened after 1 attempts.
(II) AT Translated Set 2 keyboard: Device reopened after 1 attempts.
(II) Macintosh mouse button emulation: Device reopened after 1 attempts.

Backtrace:
0: /usr/bin/X (xorg_backtrace+0x3b) [0x80e938b]
1: /usr/bin/X (0x8048000+0x61c8d) [0x80a9c8d]
2: (vdso) (__kernel_rt_sigreturn+0x0) [0xb784f410]
3: /usr/bin/X (0x8048000+0x2a1b0) [0x80721b0]
4: /usr/bin/X (0x8048000+0x1ed7a) [0x8066d7a]
5: /lib/tls/i686/cmov/libc.so.6 (__libc_start_main+0xe6) [0xb7579bd6]
6: /usr/bin/X (0x8048000+0x1e961) [0x8066961]

Caught signal 3 (Quit). Server aborting

Please consult the The X.Org Foundation support 
         at http://wiki.x.org
 for help. 
Please also check the log file at "/var/log/Xorg.0.log" for additional information.

(II) Power Button: Close
(II) UnloadModule: "evdev"
(II) Power Button: Close
(II) UnloadModule: "evdev"
(II) Logitech Optical USB Mouse: Close
(II) UnloadModule: "evdev"
(II) AT Translated Set 2 keyboard: Close
(II) UnloadModule: "evdev"
(II) Macintosh mouse button emulation: Close
(II) UnloadModule: "evdev"
 ddxSigGiveUp: Closing log

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-09:

#84

xorg.conf Edit (1.5 KiB, text/plain)

For completeness, my xorg.conf so you can see details on loaded modules, devices and monitor setup. Just in case you wonder whether I misconfigured my keyboard (which is PS/2) as a Power Button (LOL)

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-09:

#85

Seth:
That is simply wonderful.

One of Jamie's logs from comment 62 says:
(**) Sep 07 22:16:47 NVIDIA(0): Enabling RENDER acceleration
(II) Sep 07 22:16:47 NVIDIA(0): Support for GLX with the Damage and Composite X extensions is
(II) Sep 07 22:16:47 NVIDIA(0): enabled.
NVIDIA: could not open the device file /dev/nvidiactl (No such device or address).
(EE) Sep 07 22:16:47 NVIDIA(0): Failed to initialize the NVIDIA kernel module. Please see the
(EE) Sep 07 22:16:47 NVIDIA(0): system's kernel log for additional error messages and
(EE) Sep 07 22:16:47 NVIDIA(0): consult the NVIDIA README for details.
(EE) NVIDIA(0): *** Aborting ***
(EE) Screen(s) found, but none have a usable configuration.

Fatal server error:
no screens found

Please consult the The X.Org Foundation support
at http://wiki.x.org
for help.
Please also check the log file at "/var/log/Xorg.0.log" for additional information.

ddxSigGiveUp: Closing log

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-10:

#86

So here is a theory for Seth and Jamie's case:

gdm starts its race with the tty's before the nvidia software is really ready for X.

Plymouth is running and gdm starts up, hand shakes with plymouth. Starts X.

Gdm in the handshake tells plymouth to quit and it does.

X crashes because it is too early for the nvidia driver.

Gdm starts X again.

Plymouth is no longer running so there is no handshake. Since this is no longer the first time an X server has been started, the hack to stop the use of tty1-6 no longer applies. Therefore gdm starts X with no VTx parameter. (How that could happen is what had been bothering me--this explains it, Seth).

@Seth: for more confimation: when you are on tty1 does 'ps aux' show X without the last parameter being VTx?

Since X hasn't been told where to go, and tty1 still hasn't started X starts there, or if it has started and tty2 hasn't X goes the first free one.

X may crash some more, but gdm would limit how many times. There are examples of that on the "interwebs".

tty1 starts and, as the dev said, hears Klingon. Some of that sounds like a login to it, but a wrong one. After several attempts, and usually some minutes, it restarts and sets the ISIG flag. There may be more going on here, too.

An enter sends a QUIT signal to X, or a 2 sends the INT signal (keycode 3, control-c) because ISIG is incorrectly set and raw keyscan codes are being sent.

X crashes.

By now all the tty's are started, so when gdm restarts X, even though there is no VTn parameter at the end of the start command, it goes to VT7 or VT8 (8, I guess) and that isn't shared, so all is well.

My working system gets this message once: "WARNING: GdmDisplay: display lasted 0.262511 seconds", but in his log, Jamie gets two in quick succession at boot time. I found an item on the internet relating messages like that to X crashes.
-------------------
What part of the theory can be contradicted? Confirmed?

Other theories?

@Jamie, can you find more Xorg.0.log's (or, more probably, :m.log.n's in /var/log/gdm) that can help confirm that Seth's find is present in your case? Another ps aux that confirms the absence of the VTn parameter at the end of the started X command? Xorg.0.log.old is at least one generation too old after the Enter crash, so the gdm :n.log.n's are probably where to look for another X log with the No Screens error and Radeon error messages.

Whether not my theory is correct is not, that's a great find, Seth!

So here is a theory for Seth and Jamie's case:

gdm starts its race with the tty's before the nvidia software is really ready for X.

Plymouth is running and gdm starts up, hand shakes with plymouth. Starts X.

Gdm in the handshake tells plymouth to quit and it does.

X crashes because it is too early for the nvidia driver.

Gdm starts X again.

Plymouth is no longer running so there is no handshake. Since this is no longer the first time an X server has been started, the hack to stop the use of tty1-6 no longer applies. Therefore gdm starts X with no VTx parameter. (How that could happen is what had been bothering me--this explains it, Seth).

@Seth: for more confimation: when you are on tty1 does 'ps aux' show X without the last parameter being VTx?

Since X hasn't been told where to go, and tty1 still hasn't started X starts there, or if it has started and tty2 hasn't X goes the first free one.

X may crash some more, but gdm would limit how many times. There are examples of that on the "interwebs".

tty1 starts and, as the dev said, hears Klingon. Some of that sounds like a login to it, but a wrong one. After several attempts, and usually some minutes, it restarts and sets the ISIG flag. There may be more going on here, too.

An enter sends a QUIT signal to X, or a 2 sends the INT signal (keycode 3, control-c) because ISIG is incorrectly set and raw keyscan codes are being sent.

X crashes.

By now all the tty's are started, so when gdm restarts X, even though there is no VTn parameter at the end of the start command, it goes to VT7 or VT8 (8, I guess) and that isn't shared, so all is well.

My working system gets this message once: "WARNING: GdmDisplay: display lasted 0.262511 seconds", but in his log, Jamie gets two in quick succession at boot time. I found an item on the internet relating messages like that to X crashes.
-------------------
What part of the theory can be contradicted? Confirmed?

Other theories?

@Jamie, can you find more Xorg.0.log's (or, more probably, :m.log.n's in /var/log/gdm) that can help confirm that Seth's find is present in your case? Another ps aux that confirms the absence of the VTn parameter at the end of the started X command? Xorg.0.log.old is at least one generation too old after the Enter crash, so the gdm :n.log.n's are probably where to look for another X log with the No Screens error and Radeon error messages.

Whether not my theory is correct is not, that's a great find, Seth!

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-10:

#87

Xorg-logs-no-screens-found.tgz Edit (3.4 KiB, application/x-tar)

@John,

Thanks again for keeping us moving. To the best of my ability/understanding, your theories are sound to me, FWIW! Here's some more info, as requested:

(a) My /var/run/gdm/firstserver.stamp does have a time matching my most recent boot.

(b) I searched all of my /var/log/Xorg.* files for "no screens found" and found 7 files with this string--attached.

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-10:

#88

@Jamie,

It's my pleasure to work with you all.

Would you mind posting your /var/log/udev file and /var/log/dmesg file? It probably doesn't matter when they are collected relative to when the bug occurs.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-10:

#89

udev Edit (258.4 KiB, text/plain)

@John, my /var/log/udev file...

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-10:

#90

dmesg Edit (58.4 KiB, text/plain)

@John, my /var/log/dmesg file...

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-10:

#91

I don't see any errors in the dmesg log and it looks like the nvidia kernel module is ready rather late in that log. Looking at the udev log I would guess that the module isn't loaded any sooner than about .31 seconds before the last entry in the udev log (if I'm reading it correctly).

@Seth:
*IF* the theory hold up, I think your basic approach in https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/532047/comments/91 might make the most sense as a fix to propose. gdm should wait until the graphics system is ready--it's attempting to do that now. What would you think about the gdm job also waiting for udev to finish or udevtrigger to start rather than for the tty's? It might happen sooner than the tty's and it's probably no later than 1/3 second from the earliest it could happen without error, and actually the module might not be ready until even closer to the end of the udev run. I don't notice any other udev event that would necessarily match when the nvidia module is done initializing, much less one that would be emitted with the use of any graphics driver (and modeset/nomodeset). Do you see anything in the udev log that looks like it would work or know of any other trigger? Any other approach? Is this something you would like to experiment with?

Anyone--any other ideas?

Revision history for this message

dino99 (9d9) wrote on 2010-09-10:

#92

hello dear bug fighters,

dont want to make you sad, but all is fine on my end.

@Seth,
i'm very surprised to see such xorg.conf on your system: ok you use a crt and need to add refresh rates (maybe, not sure by the way as i'm also using a crt and nvidia-current detect it without additional settings). Keep in mind that actual kernel deal directly with X, so your xorg.conf looks like the very oldish one and your xorg.0.log might complaint about misconfig, as it fights against kernel timing and logic. You might renamed this xorg.conf and see what happen ( take care if you use non genuine packages)

Revision history for this message

Brian Rogers (brian-rogers) wrote on 2010-09-10:

#93

I'd like to point out that this isn't just an nvidia bug. It was originally filed for an ati card, and I'm seeing the problem with intel graphics. It's most likely a kernel/VT bug.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-10:

#94

Okay, I've tested Seth's patch to /etc/init/gdm.conf from https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/532047/comments/91

I first changed my /etc/default/grub back to basics and ran update-grub (so, a state that would normally cause this bug to occur after boot). I did not notice a faster boot--in fact, the splash image itself seems slightly slower. I watched each of the 5 dots turn red, in succession. In the past, I think this happened in milliseconds, rather than ~2 seconds. After the splash screen, instead of a very brief flash, I saw that same old "GLib-WARNING **: getpwuid_r(): failed due to unknown user id (0)" message on the screen for a full second or so. When I logged in, X was properly on tty7. So, this does appear to be a successful workaround for me as well.

It also confirms the precise time in which, under bug conditions, X would get bounced from tty7 to tty1 or tty2--it's immediately after the splash for me.

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-10:

#95

Hi all. I'm glad my findings helped.

For me, I learned from your comments: I wasn't even aware of the fact that initctl supplies a VTx parameter to gdm on first launch. That certainly explains how things go haywire _after_ on quick relaunch(es). I will post later with more information about the actual command line params to X.

I'm not fond of my workaround, myself:
(a) it doesn't work well with deviant numbers of vts (especially fewer)
(b) it will on average make boot to desktop take longer than necessary
(c) it doesn't actually address the issue: a pre-start scriptlet containing a stupid little 'sleep 1' would likely do exactly as well. Worst thing is, even with vt's completely initializing, things might still turn out a bit too fast for the nvidia/ATI module to be ready.

I'd say the real fix ought to be based on one of the following ideas:
(A) On X video driver initialization detect when nvidia module is ready (probably doable from with xorg by waiting for it some limited amount of time?). This will make us dependent on upstream for a fix (Xorg or worse: nvidia)
(B) pass the VTx parameter in some other way - a (more) persistent way. We might set a config in /var/run/{plymouth|gdm} - or and read it. There could be a smart algo that checks to see whether that vt is still available and use it if so. Otherwise, all bets are off and best we can do is resort to current behaviour (e.g. when starting secondary X servers: it will find the boot-time configuration parameter (VTx=/dev/tty7) in a file, and detects it is already taken. Then the best thing to do is to fall back to the next available vt, like it does right now).

This system (B) is probably not completely foolproof but it would be my preference:
- it is the smallest possible change to current setup.
- it will work way better and 'explainable' than the current situation
- it should be foolproof as long as upstart passes the right VTx param in the first place (not a vt that might be (accidentally) shared with a getty

A minimal hack would be to keep the upstart scripts largely unchanged, except for _also_ storing the optional VTx param in a file. I think I'll do just that tonight (have to start making dinner now :))

$0.02

Hi all. I'm glad my findings helped.

For me, I learned from your comments: I wasn't even aware of the fact that initctl supplies a VTx parameter to gdm on first launch. That certainly explains how things go haywire _after_ on quick relaunch(es). I will post later with more information about the actual command line params to X.

I'm not fond of my workaround, myself:
(a) it doesn't work well with deviant numbers of vts (especially fewer)
(b) it will on average make boot to desktop take longer than necessary
(c) it doesn't actually address the issue: a pre-start scriptlet containing a stupid little 'sleep 1' would likely do exactly as well. Worst thing is, even with vt's completely initializing, things might still turn out a bit too fast for the nvidia/ATI module to be ready.

I'd say the real fix ought to be based on one of the following ideas:
(A) On X video driver initialization detect when nvidia module is ready (probably doable from with xorg by waiting for it some limited amount of time?). This will make us dependent on upstream for a fix (Xorg or worse: nvidia)
(B) pass the VTx parameter in some other way - a (more) persistent way. We might set a config in /var/run/{plymouth|gdm} - or and read it. There could be a smart algo that checks to see whether that vt is still available and use it if so. Otherwise, all bets are off and best we can do is resort to current behaviour (e.g. when starting secondary X servers: it will find the boot-time configuration parameter (VTx=/dev/tty7) in a file, and detects it is already taken. Then the best thing to do is to fall back to the next available vt, like it does right now).

This system (B) is probably not completely foolproof but it would be my preference:
- it is the smallest possible change to current setup.
- it will work way better and 'explainable' than the current situation
- it should be foolproof as long as upstart passes the right VTx param in the first place (not a vt that might be (accidentally) shared with a getty

A minimal hack would be to keep the upstart scripts largely unchanged, except for _also_ storing the optional VTx param in a file. I think I'll do just that tonight (have to start making dinner now :))

$0.02

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-10:

#96

@Brian
This bug was split at about comment 30. We have been working on the case where the X crash happens some time after login because it has been started on tty1. Bilal's case start on tty7 is now being discussed in bug #626723. I feel a little bad that we aren't discussing his situation but that's how others split things up. I hope Bilal's not unhappy.

I don't think the current theory for Jamie and Seth's case of this bug necessarily depends on the nvidia driver being used. It is that the driver, whichever one it is, isn't ready to work with X as soon as the current gdm upstart job's preconditions are met and that breaks the assumptions made in the methods used to prevent tty1 from being fought over. I think a good case could be made for delaying gdm's start until X can be successfully run.

If you have the "start on tty1" situation would you mind attaching the output of ps aux gathered while you are still on tty1 (before the crash)? That would be a good start toward figuring out if you match the pattern we are currently discussing or if there is another case to be addressed.

By the way, we know there are two ways the X crash on Enter or 2 key can happen--X started on tty1 or tty7. There may again be several ways either of those might happen. In particular we don't know here for sure that Jamie and Seth are really the same and don't know the root cause of Dino problems.

------------------------------------------

@Dino
I can't speak for all of your fellow bug fighters, but I'm very happy that your problem is gone. I'm glad you are still with us. Since a race condition is involved, I still really don't know whether the problem you were experiencing was the same thing as Jamie's or not. I can't challenge your assumptions in comment 75. The way this is supposed to work is quite complicated and the config files must be correct to avoid problems.

I hope you will again speak up about your situation if the problem comes back, but I hope it doesn't. I was having the tty1 problem with pre-release lucid last Spring until the Plymouth megafix and it's a real nuisance (I had the tty7 problem sometime during testing, too) my tty1 cases were intermittent, where the problem would happen during one boot and everything would work the next--with no change at all. I spent more than a week researching what was happening until the devs were able to fix it.

Thanks for sticking with us and for continuing to contribute (and for your patient contributions above).
-----------------------------------------------

@Seth
Do you think your xorg.conf file could be involved in the problem?

@Brian
This bug was split at about comment 30. We have been working on the case where the X crash happens some time after login because it has been started on tty1. Bilal's case start on tty7 is now being discussed in bug #626723. I feel a little bad that we aren't discussing his situation but that's how others split things up. I hope Bilal's not unhappy.

I don't think the current theory for Jamie and Seth's case of this bug necessarily depends on the nvidia driver being used. It is that the driver, whichever one it is, isn't ready to work with X as soon as the current gdm upstart job's preconditions are met and that breaks the assumptions made in the methods used to prevent tty1 from being fought over. I think a good case could be made for delaying gdm's start until X can be successfully run.

If you have the "start on tty1" situation would you mind attaching the output of ps aux gathered while you are still on tty1 (before the crash)? That would be a good start toward figuring out if you match the pattern we are currently discussing or if there is another case to be addressed.

By the way, we know there are two ways the X crash on Enter or 2 key can happen--X started on tty1 or tty7. There may again be several ways either of those might happen. In particular we don't know here for sure that Jamie and Seth are really the same and don't know the root cause of Dino problems.

------------------------------------------

@Dino
I can't speak for all of your fellow bug fighters, but I'm very happy that your problem is gone. I'm glad you are still with us. Since a race condition is involved, I still really don't know whether the problem you were experiencing was the same thing as Jamie's or not. I can't challenge your assumptions in comment 75. The way this is supposed to work is quite complicated and the config files must be correct to avoid problems.

I hope you will again speak up about your situation if the problem comes back, but I hope it doesn't. I was having the tty1 problem with pre-release lucid last Spring until the Plymouth megafix and it's a real nuisance (I had the tty7 problem sometime during testing, too) my tty1 cases were intermittent, where the problem would happen during one boot and everything would work the next--with no change at all. I spent more than a week researching what was happening until the devs were able to fix it.

Thanks for sticking with us and for continuing to contribute (and for your patient contributions above).
-----------------------------------------------

@Seth
Do you think your xorg.conf file could be involved in the problem?

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-10:

#97

@Gruber
No I don't think my xorg.conf is involved.

Now on the ps -aux output: I can confirm that when X startup is 'sane' (works first time around) then the vt7 param is indeed passed. If X ends up on vt2, it is not:

/root$ ps -ef | grep X | tee -a vts.log
root 1274 1252 0 16:57 tty2 00:00:02 /usr/bin/X :1 -br -verbose -auth /var/run/gdm/auth-for-gdm-nPB6b1/database -nolisten tcp

However, I'd be very interested to found out exactly _where_ the necessary vty param comes from, because I can't find a reference to it from upstart (/etc/init), nor can I see it being passed directly or indirectly to gdm:

`- gdm-binary
| `- gdm-binary
| `- /usr/lib/gdm/gdm-simple-slave --display-id /org/gnome/DisplayManager/Display1
| `- /usr/lib/gdm/gdm-session-worker
| | `- /usr/lib/gdm/gdm-session-worker
| `- /usr/lib/gdm/gdm-simple-slave --display-id /org/gnome/DisplayManager/Display1
| `- /usr/bin/X :0 -br -verbose -auth /var/run/gdm/auth-for-gdm-GED924/database -nolisten tcp vt7

Hope someone can point me in the right direction so I can have a stab at a fix

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-10:

#98

ne'er mind I found it in gdm-server.c (05_initial_server_on_vt7.patch)

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-10:

#99

Wouldn't it simply appear that /var/run/gdm/firstserver.stamp is simply created prematurely...?

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-10:

#100

@Seth:
I don't believe initctl passes a VT parameter. gdm is either starting X with a VT parameter, or not. It determines whether plymouth is running and if it is, and if it has its screen before the user, gdm tries to determine what VT that is and specifies that to X in a VT parameter. If that is ever VT1, etc. we will know that went wrong. If plymouth isn't running or doesn't think it is showing the splash screen to the user, but the /var/run/gdm file is missing, gdm figures this is the first X to start and hardcodes a start to VT7. If none of that is true, it starts X without a VT parameter, assuming that since it isn't the first launch of X all of the tty jobs will have started by then and there won't be a conflict. There's our current problem.

Does that make sense?

Please see the Ubuntu/debian patched gdm files daemon/gdm-simple-slave.c and daemon/gdm-server.c. After getting the source you need to apply the patches in debian/patches by doing a "debian/rules patch" (I always forget).

If gdm isn't restarting it might be able to remember that the /var/run/gdm file was missing when it looked and keep the hardcoded VT7. I don't know whether gdm is restarting after X halts with a non-ready kernel module. That possible fix wouldn't fix gdm and X spinning until the the kernel is ready, however.

The devs understand that the /var/run/gdm/firstserver-stamp approach is a hack. See the comments in the daemon/gdm-server.c gdm_server_start() function. That's where the vt7 hardcode is that's supposed to stop this from happening if plymouth hasn't already led gdm to vt7.

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-10:

#101

@Jamie:
Great! That's confirmation that gdm is starting too early relative to the tty's. and helps confirm similarity of your situation with Seth's.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-10:

#102

crash-from-X-on-tty7.tgz Edit (6.5 KiB, application/x-tar)

New crash for me, possibly a fluke. I reverted my /etc/default/grub to simply use the plymouth:debug on the boot line as a workaround, and also reverted my /etc/init/gdm.conf (after trying Seth's workaround). My next reboot looked normal, and X was on tty7, but I crashed very quickly, and not when pressing Enter.

Xorg and plymouth-debug.log logs attached.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-10:

#103

plymouth-debug.log Edit (22.0 KiB, text/plain)

Okay, the crash from tty7 happened twice in a row, after reboots, for me. I now see that I'd changed two different variables when I thought the simple "plymouth:debug" was acting as a workaround. I'd also changed these lines in /etc/default/grub, to allow me to catch grub and test boot parameters (rather than having it immediately boot the default)...

This version prevented a crash from occurring:
#GRUB_HIDDEN_TIMEOUT=0
GRUB_HIDDEN_TIMEOUT_QUIET=false

This version would allow a crash to occur:
GRUB_HIDDEN_TIMEOUT=0
GRUB_HIDDEN_TIMEOUT_QUIET=true

Regardless of which of the above settings are used, this one seems to always cause X to land on tty7 for me:
GRUB_CMDLINE_LINUX_DEFAULT="plymouth:debug"

The good news is that this has allowed me to capture a plymouth-debug.log file generated at a boot that I suspect will soon result in a crash...

Revision history for this message

dino99 (9d9) wrote on 2010-09-10:

#104

hello guys,

i'm reading your posts with interest, as my system boot and log as it might now, here is grub settings:

GRUB_HIDDEN_TIMEOUT=0
GRUB_HIDDEN_TIMEOUT_QUIET=true
GRUB_TIMEOUT=10

what i see when booting and logging is that hdd activity have not ended when i validate my password login, so i think ureadahead is still working. Can it be a valuable reason of conflict in this gdm/X crash ? Does these crashes mostly happened on the fastest or the lowest systems, with pata/sata/ssd ?

mine is: sata, 8500gt, 4 go

Revision history for this message

dino99 (9d9) wrote on 2010-09-10:

#105

might be fixed due to bug 626723 solution (apache)

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-10:

#106

@dino99: These crashes definitely seem to correlate with the fastest systems. Myself and a few others with the bug have SSD, 8 GB and fast processors.

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-10:

#107

Download full text (3.4 KiB)

@Jamie: One of the earlier X logs you posted had an error that ended like that--without giving any backtrace.

I see you are running apache. Does Dino's have your fix?

Sounds like our fraternal twin bug may have been fixed since Bilal confirmed it is gone. Great catch, Dino. It also sounds like vt7 is as fragile a place to start X as vt1 or vt2, unfortunately.

I have a suggestion for helping us determine how different solutions to both of our currently known problems (X crashes when kernel module is not ready, and early X crashes cause X to start on tty1-6) might delay the login screen, but I'm not sure Jamie's computer is in a position to run them. I hope the apache fix does the trick.

My suggestion would be to add init jobs that would run the logger command when current gdm start conditions are met, when tty starting conditions are met, when udev ends, when nvidia is modprobed, and other events others think might be significant. I don't know how long we need to delay gdm's start, or delay gdm's restart of X, which might be another approach.

@Seth:
I'm still trying to understand your thoughts in your solution (B). If I understand, you thought init was telling gdm that it was boot time and start on vt7. init runs mountall which mounts a tempfs (like on /tmp) on /var/run. That wipes out the last boot stamp file and gdm looks for it when it runs without direction from plymouth. So init doesn't explictly tell gdm to run on vt7, or even that it is boot time, but that is the point of a file in the /var/run file system. (The file doesn't really get rm'd, it only existed in RAM during the last boot, but the tmpfs at /var/run keeps it from being kept).

But it doesn't have to work that way, of course. Your approach sounds good to me. init jobs could explicitly create a booting-done file and gdm could look for that booting-done file in /var/run. I don't know whether is should specify a vt or whether that should be local to plymouth, gdm, and X. Is that the kind of thing you are talking about? I agree with you (A) sounds good, but is probably impossible to do, particularly between now and 10/10.

Because of the 10/10 Maverick release, small, contained, fixes are probably best. And I assume the release team won't want to slow down boot more than necessary and avoid blinking the display unnecessarily.

Another, minimal approach, might be to find where in gdm X is ready and create the stamp file there rather than on initial start so retries go to the same place. To address the looping problem, maybe after a failure gdm could sleep for a few seconds before trying X again. Just that change might address both problems and the first part of it is very like what is in place now. In effect gdm would be polling X and it would be polling the kernel.

I like both the upstart job and the gdm modification ideas.

The perfect fix where upstart is being used to minimize start time would be for nvidia (and intel and ati) to emit a signal to init so it could delay gdm until the necessary kernel driver is ready. But either every driver would need to do that, or the gdm jobs would have to change with the driver, or there would have to be several gdm scripts, each tes...

@Jamie: One of the earlier X logs you posted had an error that ended like that--without giving any backtrace.

I see you are running apache. Does Dino's have your fix?

Sounds like our fraternal twin bug may have been fixed since Bilal confirmed it is gone. Great catch, Dino. It also sounds like vt7 is as fragile a place to start X as vt1 or vt2, unfortunately.

I have a suggestion for helping us determine how different solutions to both of our currently known problems (X crashes when kernel module is not ready, and early X crashes cause X to start on tty1-6) might delay the login screen, but I'm not sure Jamie's computer is in a position to run them. I hope the apache fix does the trick.

My suggestion would be to add init jobs that would run the logger command when current gdm start conditions are met, when tty starting conditions are met, when udev ends, when nvidia is modprobed, and other events others think might be significant. I don't know how long we need to delay gdm's start, or delay gdm's restart of X, which might be another approach.

@Seth:
I'm still trying to understand your thoughts in your solution (B). If I understand, you thought init was telling gdm that it was boot time and start on vt7. init runs mountall which mounts a tempfs (like on /tmp) on /var/run. That wipes out the last boot stamp file and gdm looks for it when it runs without direction from plymouth. So init doesn't explictly tell gdm to run on vt7, or even that it is boot time, but that is the point of a file in the /var/run file system. (The file doesn't really get rm'd, it only existed in RAM during the last boot, but the tmpfs at /var/run keeps it from being kept).

But it doesn't have to work that way, of course. Your approach sounds good to me. init jobs could explicitly create a booting-done file and gdm could look for that booting-done file in /var/run. I don't know whether is should specify a vt or whether that should be local to plymouth, gdm, and X. Is that the kind of thing you are talking about? I agree with you (A) sounds good, but is probably impossible to do, particularly between now and 10/10.

Because of the 10/10 Maverick release, small, contained, fixes are probably best. And I assume the release team won't want to slow down boot more than necessary and avoid blinking the display unnecessarily.

Another, minimal approach, might be to find where in gdm X is ready and create the stamp file there rather than on initial start so retries go to the same place. To address the looping problem, maybe after a failure gdm could sleep for a few seconds before trying X again. Just that change might address both problems and the first part of it is very like what is in place now. In effect gdm would be polling X and it would be polling the kernel.

I like both the upstart job and the gdm modification ideas.

The perfect fix where upstart is being used to minimize start time would be for nvidia (and intel and ati) to emit a signal to init so it could delay gdm until the necessary kernel driver is ready. But either every driver would need to do that, or the gdm jobs would have to change with the driver, or there would have to be several gdm scripts, each testing to see if it is the one responsible job to start gdm for the graphics driver configuration in use. Maybe the multiple jobs could be done, but I wouldn't like to see that tried this late in the release cycle.

Just joining in brainstorming. What do you think about temporary init jobs to measure time between events?

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-10:

#108

@Bilal:
Congratulations.

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-10:

#109

proposed test patch Edit (1.3 KiB, text/plain)

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-10:

#110

@Gruber: duh... our posts crossed. Consider my patch a response :)

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-10:

#111

Dammit! Launchpad ate my comment with #109... Ok, typing it up again

I think I have a fix that I like. I added a debian patch (40_initial_server_on_vt7-improve-timing.patch see #109)
What happens is that I moved the creation of the signal file (/var/run/gdm/firstserver.stamp) to the emit_ready_idle handler. This appears to work on my system:

Having seen some of the fastest boots, X always landed on vt7 safely. Also, vt7 is on the commandline parameters.
I have seen evidence in Xorg.*.log that this appeared to work even though I had four X session launch attempts within about a second (the first failing because of nvidia not being around).

I have uploaded a snapshot package to a temp ppa for any interested testers: https://launchpad.net/~bugs-sehe/+archive/gdm625239. Please allow some time for the build server to complete the binaries

Note: I'm not a debian dev, I don't know about their codeing standards. I leave it to you to review/incorporate any of this as you like. I haven't actually even studied the code mouch to explain whether/why this should work, I just tried and it seemed to do the trick :)

Cheers
Seth

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-10:

#112

@John: Yes, I'm running apache. I don't see a way for me to test Dino's bug 626723 on my Lucid machine. That bug appears to have a proposed fix for Maverick. Also, the suggested workaround doesn't make sense on Lucid--there is no "stty sane" line in my /etc/init.d/apache2.

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-10:

#113

@Jamie: Rats.

We don't need the plymouth:debug anymore. Would deleting it help? Can you simply go back to where you were last night when you were getting logs (albeit with the tty1 bug). If you can, maybe you can try Seth's new gdm fix.

modeset and nomodeset are other possible things to try (I can't tell from the log what it is doing, and I don't even know if the nvidia driver supports kernel mode setting).

In grub the linux cmdline parameters matter, but nothing else should. Grub should be loading the same kernel and initramfs into memory and passing control to linux. The linux command line is the only way to tell the kernel it just loaded to do anything different. The other parameters just tell grub how to behave before it loads the kernel.

The xorg.0.log doesn't help. I can't tell if it is the same kind of problem as in this and bug #626723 or not since it just says it's going to give a backtrace and then doesn't.

A dmesg or syslog from when the crash occurs may contain more information, you might look there. Attach them if you want us to take a look. Is there an X log in something like :n.log.n that shows the error with more output? It's one thing after another, I'm afraid.

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-10:

#114

@Seth:
I just saw that I missed several of your messages above. In particular 98 and 99. I'm sorry. Before I say it you coded it and had it building on launchpad. Great.

I see no problem with your code. I built your change here and am running it successfully. Since I didn't see the bug I can't say any more.

You said: > I have seen evidence in Xorg.*.log that this appeared to work even though I had four X session launch attempts within about a second (the first failing because of nvidia not being around).

daemon/gdm-local-display-factory.c:#define MAX_DISPLAY_FAILURES 5

Sounds like you are close to the limit. Do you think there was another reason for the other 4? I do get one message about a session not lasting long, but not two or more like you and Jamie. I assume that's normal but I don't know.

Even if we go no farther here I want to note in the record that driver failures are happening because X is being started prematurely.

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-10:

#115

> Sounds like you are close to the limit
Yeah in retrospect I can't be to sure I haven't been counting logs from across reboots there; I had been rebooting like hell before it dawned on me that I would need to look at the logfiles to find out whether the 'early X failure due to nvidia not ready' was happening at all, because, well obviously my X desktop kept coming up on vt7 all the time...

I attach a later run that I _did_ save so you can see how that looked; this too ended up happily on vt7

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-10:

#116

In syslog you might find messages like:
gdm_binary[2542]WARNING: GdmDisplay: display lasted 0.945450 seconds
Those are the ones Jamie has coming out quickly. I only get one when I start. It may be easier to look for those than to look at the log files.

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-10:

#117

Thanks for the hint. Very timely.

I just learned that some things are best not kept on tmpfs, especially when doing frequent reboots <whistle/>

I sort of forced the issue just to prove that the mechanism works as advertised by doing

chvt 1
stop gdm; sleep 1; rm -fv /var/log/Xorg.*

# emulate a fresh boot...
rm /var/run/gdm/firstserver.stamp
rmmod nvidia
lsmod | grep nv # verified empty
modprobe nvidia-current & start gdm

resulting X session is on vt7 _and_ has that specified on the commandline explicitely, showing that the patch was effective by preventing the firstserver.stamp to be created before successful launch of X

Attached are the logs from this particular run

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-10:

#118

"synthetic" retest of gdm X respawn on nvidia not ready Edit (3.9 KiB, application/x-tar)

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-10:

#119

The timestamps from those logfiles (#118) were

root@lucid:/tmp# ls -ltr --full-time var/log/
total 32
-rw-r--r-- 1 root root 6399 2010-09-11 00:14:15.000000000 +0200 Xorg.1.log
-rw-r--r-- 1 root root 6399 2010-09-11 00:14:15.000000000 +0200 Xorg.0.log
-rw-r--r-- 1 root root 13404 2010-09-11 00:14:52.000000000 +0200 Xorg.2.log

Indeed syslog shows the message you hinted me to. Note that only the messages from past midnight (9/11) include my synthetic regression test. I was running out of patience rebooting...

Sep 10 00:10:42 localhost gdm-binary[1129]: WARNING: GdmDisplay: display lasted 0.539277 seconds
Sep 10 00:10:42 localhost gdm-binary[1129]: WARNING: GdmDisplay: display lasted 0.157422 seconds
Sep 10 00:14:39 localhost gdm-binary[1134]: WARNING: GdmDisplay: display lasted 0.518092 seconds
Sep 10 00:14:39 localhost gdm-binary[1134]: WARNING: GdmDisplay: display lasted 0.152144 seconds
Sep 10 00:40:01 localhost gdm-binary[1049]: WARNING: GdmDisplay: display lasted 0.544773 seconds
Sep 10 00:40:01 localhost gdm-binary[1049]: WARNING: GdmDisplay: display lasted 0.158739 seconds
Sep 10 01:14:29 localhost gdm-binary[1116]: WARNING: GdmDisplay: display lasted 0.520776 seconds
Sep 10 10:06:29 localhost gdm-binary[1125]: WARNING: GdmDisplay: display lasted 0.517127 seconds
Sep 10 16:57:49 localhost gdm-binary[1056]: WARNING: GdmDisplay: display lasted 0.519798 seconds
Sep 10 20:24:25 localhost gdm-binary[973]: WARNING: GdmDisplay: display lasted 0.550640 seconds
Sep 10 20:24:25 localhost gdm-binary[973]: WARNING: GdmDisplay: display lasted 0.169691 seconds
Sep 10 22:05:09 localhost gdm-binary[891]: WARNING: GdmDisplay: display lasted 0.775847 seconds
Sep 11 00:13:02 localhost gdm-binary[2676]: WARNING: GdmDisplay: display lasted 0.128627 seconds
Sep 11 00:13:02 localhost gdm-binary[2676]: WARNING: GdmDisplay: display lasted 0.129041 seconds
Sep 11 00:13:02 localhost gdm-binary[2676]: WARNING: GdmDisplay: display lasted 0.128683 seconds
Sep 11 00:13:02 localhost gdm-binary[2676]: WARNING: GdmDisplay: display lasted 0.128944 seconds
Sep 11 00:13:02 localhost gdm-binary[2676]: WARNING: GdmDisplay: display lasted 0.128686 seconds
Sep 11 00:13:02 localhost gdm-binary[2676]: WARNING: GdmDisplay: display lasted 0.127977 seconds
Sep 11 00:13:34 localhost gdm-binary[2782]: WARNING: GdmDisplay: display lasted 0.654594 seconds
Sep 11 00:13:34 localhost gdm-binary[2782]: WARNING: GdmDisplay: display lasted 0.137125 seconds
Sep 11 00:14:15 localhost gdm-binary[2887]: WARNING: GdmDisplay: display lasted 0.652483 seconds
Sep 11 00:14:15 localhost gdm-binary[2887]: WARNING: GdmDisplay: display lasted 0.135985 seconds

The timestamps from those logfiles (#118) were

root@lucid:/tmp# ls -ltr --full-time var/log/
total 32
-rw-r--r-- 1 root root  6399 2010-09-11 00:14:15.000000000 +0200 Xorg.1.log
-rw-r--r-- 1 root root  6399 2010-09-11 00:14:15.000000000 +0200 Xorg.0.log
-rw-r--r-- 1 root root 13404 2010-09-11 00:14:52.000000000 +0200 Xorg.2.log

Indeed syslog shows the message you hinted me to. Note that only the messages from past midnight (9/11) include my synthetic regression test. I was running out of patience rebooting...

Sep 10 00:10:42 localhost gdm-binary[1129]: WARNING: GdmDisplay: display lasted 0.539277 seconds
Sep 10 00:10:42 localhost gdm-binary[1129]: WARNING: GdmDisplay: display lasted 0.157422 seconds
Sep 10 00:14:39 localhost gdm-binary[1134]: WARNING: GdmDisplay: display lasted 0.518092 seconds
Sep 10 00:14:39 localhost gdm-binary[1134]: WARNING: GdmDisplay: display lasted 0.152144 seconds
Sep 10 00:40:01 localhost gdm-binary[1049]: WARNING: GdmDisplay: display lasted 0.544773 seconds
Sep 10 00:40:01 localhost gdm-binary[1049]: WARNING: GdmDisplay: display lasted 0.158739 seconds
Sep 10 01:14:29 localhost gdm-binary[1116]: WARNING: GdmDisplay: display lasted 0.520776 seconds
Sep 10 10:06:29 localhost gdm-binary[1125]: WARNING: GdmDisplay: display lasted 0.517127 seconds
Sep 10 16:57:49 localhost gdm-binary[1056]: WARNING: GdmDisplay: display lasted 0.519798 seconds
Sep 10 20:24:25 localhost gdm-binary[973]: WARNING: GdmDisplay: display lasted 0.550640 seconds
Sep 10 20:24:25 localhost gdm-binary[973]: WARNING: GdmDisplay: display lasted 0.169691 seconds
Sep 10 22:05:09 localhost gdm-binary[891]: WARNING: GdmDisplay: display lasted 0.775847 seconds
Sep 11 00:13:02 localhost gdm-binary[2676]: WARNING: GdmDisplay: display lasted 0.128627 seconds
Sep 11 00:13:02 localhost gdm-binary[2676]: WARNING: GdmDisplay: display lasted 0.129041 seconds
Sep 11 00:13:02 localhost gdm-binary[2676]: WARNING: GdmDisplay: display lasted 0.128683 seconds
Sep 11 00:13:02 localhost gdm-binary[2676]: WARNING: GdmDisplay: display lasted 0.128944 seconds
Sep 11 00:13:02 localhost gdm-binary[2676]: WARNING: GdmDisplay: display lasted 0.128686 seconds
Sep 11 00:13:02 localhost gdm-binary[2676]: WARNING: GdmDisplay: display lasted 0.127977 seconds
Sep 11 00:13:34 localhost gdm-binary[2782]: WARNING: GdmDisplay: display lasted 0.654594 seconds
Sep 11 00:13:34 localhost gdm-binary[2782]: WARNING: GdmDisplay: display lasted 0.137125 seconds
Sep 11 00:14:15 localhost gdm-binary[2887]: WARNING: GdmDisplay: display lasted 0.652483 seconds
Sep 11 00:14:15 localhost gdm-binary[2887]: WARNING: GdmDisplay: display lasted 0.135985 seconds

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-10:

#120

Oh PS. all, because I'm starting to feel this workaround is looking pretty good[1] in my own (humble) opinion - functionally - I wish to point out that this is probably best integrated directly into an altered 05_initial_server_on_vt7.patch instead of adding another patch.

My patch (40_initial_server_on_vt7-improve-timing.patch) was merely intended as a /quick/ test of things and does _nothing_ more than move roughly half of the code blocks inserted by 05_initial_server_on_vt7.patch into a different location. I don't know who is entitled to do such (Sebastian Bacher?), also I probably have no commit access anywhere anyway, so manual integration of the patch into official ubuntu (debian? I'm such a noob!) is probably the best way anyway...

Cheers, I'm off to bed now!

[1] strange how a synthesized test can build more confidence in a programmer, even though a reallife test is obviously worth more...; I think it has to do with (a) having control (b) checking your thoughts doubly along the way - but I digress

Brian Murray (brian-murray) on 2010-09-11

tags:

added: patch

John S. Gruber (jsjgruber) on 2010-09-11

description:	updated
summary:	- X starts on wrong tty: pressing enter after 5 minutes crashes X + X starts on wrong tty: pressing enter after 5 minutes crashes X; gdm + restart of X involved

Revision history for this message

Steve Langasek (vorlon) wrote on 2010-09-14: Re: X starts on wrong tty: pressing enter after 5 minutes crashes X; gdm restart of X involved

#121

Jamie,

> Okay, I've done some experimenting with my linux command options. If I
> have vga=799 on there, I boot to tty7 every time. With or without the "quite"
> and/or "splash" options, I see the same results. However, when I tried
> vga=771, I end up on tty1 or tty2 again.

Please show the contents of a plymouth:debug log, and /proc/fb, when booting with vga=799 (and without grub_gfxmode set).

Seth, Jamie, can one of you also show the output of 'grep fb /var/log/kern.log' on a system showing this problem?

Kudos to John and Seth for the great debugging work on this. It took me quite a while to read through the full bug, but I think there's ample evidence here to support reassigning this to gdm and the nvidia drivers.

I would like to know why the gdm upstart job's start condition is insufficient here; for reference, the start condition is:

start on (filesystem
          and started dbus
          and (graphics-device-added fb0 PRIMARY_DEVICE_FOR_DISPLAY=1
               or drm-device-added card0 PRIMARY_DEVICE_FOR_DISPLAY=1
               or stopped udevtrigger))

It's possible this is wrong for nvidia because /dev/fb0 is a vga (or vesa) framebuffer instead of the nvidia one, so gdm goes ahead and starts before the nvidia driver is ready. Alternatively, it's possible that the udev event happens before the card is actually ready (which would be a bug in the nvidia driver). Jamie, is the udev log file attached to comment 89 from a boot with or without the vga=799 workaround, and with or without an X crash on startup?

You might also be able to confirm this is the problem by commenting out the "graphics-device-added fb0" part of the start condition.

summary:

- X starts on wrong tty: pressing enter after 5 minutes crashes X; gdm
- restart of X involved
+ X starts on wrong tty because gdm starts before nvidia driver is ready

Revision history for this message

Steve Langasek (vorlon) wrote on 2010-09-14:

#122

has anyone reproduced this nvidia+gdm crash on maverick? Marking incomplete for maverick pending confirmation.

description:	updated
affects:	plymouth (Ubuntu Maverick) → gdm (Ubuntu Maverick)
affects:	xserver-xorg-video-ati (Ubuntu Maverick) → nvidia-graphics-drivers (Ubuntu Maverick)
Changed in nvidia-graphics-drivers (Ubuntu Maverick):
status:	Invalid → Confirmed
description:	updated
tags:	added: lucid removed: maverick
Changed in nvidia-graphics-drivers (Ubuntu Lucid):
status:	New → Confirmed
Changed in gdm (Ubuntu Maverick):
status:	Confirmed → New
Changed in nvidia-graphics-drivers (Ubuntu Maverick):
status:	Confirmed → Incomplete
Changed in gdm (Ubuntu Maverick):
status:	New → Incomplete
Changed in gdm (Ubuntu Lucid):
status:	New → Confirmed
Changed in nvidia-graphics-drivers (Ubuntu Maverick):
importance:	Medium → High

Revision history for this message

Bilal Akhtar (bilalakhtar) wrote on 2010-09-14:

#123

BTW, the maverick problem that I reported was actually affecting the apache2 package (bug #626723), which was setting incorrect stty flags for the tty* which would make X crash on pressing the enter key.

I just wanted to tell you people to see if it helps in some way.

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-14:

#124

@Steve

> Seth, Jamie, can one of you also show the output of 'grep fb /var/log/kern.log' on a system showing this problem?

can try, but I'll have to downgrade from my own ppa version first so be patient :) (tomorrow night perhaps)

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-14:

#125

@Steve

PS. the description no longer reflects that people with ATI cards have reported this issue (have I misread the comments or did you miss that?)

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-14:

#126

vga-799-plymouth-debug-no-grub_gfxmode.tar.gz Edit (35.4 KiB, application/x-tar)

Steve,

> Please show the contents of a plymouth:debug log, and /proc/fb, when
> booting with vga=799 (and without grub_gfxmode set).

Attaching /var/log/udev, /var/log/plymouth-debug.log and /proc/fb after boot with no GRUB_GFXMODE set and:
GRUB_CMDLINE_LINUX_DEFAULT="vga=799 plymouth:debug"
GRUB_CMDLINE_LINUX=""

Keep in mind that the above generally behaves as a workaround for me. I did boot with X on VT7 this time.

> Jamie, is the udev log file
> attached to comment 89 from a boot with or without the vga=799
> workaround, and with or without an X crash on startup?

I believe the log attached to 89 was without the vga=799, but I can't be absolutely certain, sorry. Instead, I probably had the plymouth:debug in there, which also turned out to be a workaround. When I had vga=799 in there, I don't believe I ever had a crash.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-14:

#127

kern.log-grep-fb.out Edit (3.4 KiB, text/plain)

Steve,

> Seth, Jamie, can one of you also show the output of 'grep fb
> /var/log/kern.log' on a system showing this problem?

Attached, after boot with standard configuration (no workarounds or related custom config at all).

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-14:

#128

Steve,
I meant to confirm that the boot that produced the output file attached to comment #127 did indeed land me with X on vt2, and I did crash, as expected, after a few minutes, when pressing enter in Terminal.

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-14:

#129

kern_snap.log Edit (75.6 KiB, text/plain)

Here is my complete kernel.log since last reboot, just after downgrading to the regular gdm package for lucid (no workarounds).
You can do your own greps if you wish :)

-------------------------------------------------------------------------------
Now if someone is ever interested in more details, what follows is a prosaic account of my last session 'with the issue'. This is hardly new information just more detail. It also serves as evidence that the attached log file is relevant as the symptoms are seen.
-------------------------------------------------------------------------------

The boot went fairly quickly, and thus, expectedly, X and gdm landed on stty2. I then went for coffee - nothing happens. I subsequently read my mail, nothing happened, I harvested this log and started typing. That's when the signs start to appear!

I had the following watch running:

Every 2.0s: date; date -r /var/run/gdm/firstserver.stamp; stty -F /dev/tty2 Tue Sep 14 16:36:25 2010

   Tue Sep 14 16:36:25 CEST 2010
   Tue Sep 14 15:55:16 CEST 2010
   speed 38400 baud; line = 0;
   -brkint ixoff -imaxbel
   -isig -iexten -echo

After about 50 keystrokes in succession (approx!) the scroll lock light came on. I froze, because I knew this was the signal that the vt is about to crash X on reception of certain scan-codes. I was sly enough quietly slip into vt1 and do:

root@lucid:~# date; while true; do sleep .5; stty -F /dev/tty2 -isig; done
Tue Sep 14 16:28:05 CEST 2010

You can see how that would provide me with (racy) immunization against terminating the X session.
I'm still on the same terminal/X session now at 16:53. Every so often, the scrolllock light will toggle on or off (keys must be pressed for it to happen but there doesn't seem to be a pattern in timing, sometimes quite rapidly in succession).

The terminal 'flags' do change every now and then, but not in synch with the scrolllock light. Note again how I do not normally have a functioning scrolllock key/light (not even when X is on a separate vt7 without any other confusion).

Approx. halft the time the output of 'stty -F /dev/tty2 'is

    speed 38400 baud; line = 0;
    -brkint ixoff -imaxbel
    -isig -iexten -echo

Once in a while the output changes to the following for some time:

    speed 38400 baud; line = 0;
    min = 1; time = 0;
    -brkint -icrnl -imaxbel
    -opost -onlcr
    -isig -icanon -iexten -echo -echoe -echok -echoctl -echoke

Note that this is when I'm keeping -isig artificially 'low' by using the background loop running on vt1. I wouldn't be typing this otherwise :)
The symptoms went away when I issued 'stop tty2'
The time is 16:59 now

Here is my complete kernel.log since last reboot, just after downgrading to the regular gdm package for lucid (no workarounds). 
You can do your own greps if you wish :)

-------------------------------------------------------------------------------
Now if someone is ever interested in more details, what follows is a prosaic account of my last session 'with the issue'. This is hardly new information just more detail. It also serves as evidence that the attached log file is relevant as the symptoms are seen.
-------------------------------------------------------------------------------

The boot went fairly quickly, and thus, expectedly, X and gdm landed on stty2. I then went for coffee - nothing happens. I subsequently read my mail, nothing happened, I harvested this log and started typing. That's when the signs start to appear!

I had the following watch running:

Every 2.0s: date; date -r /var/run/gdm/firstserver.stamp; stty -F /dev/tty2                            Tue Sep 14 16:36:25 2010

Tue Sep 14 16:36:25 CEST 2010
   Tue Sep 14 15:55:16 CEST 2010
   speed 38400 baud; line = 0;
   -brkint ixoff -imaxbel
   -isig -iexten -echo

After about 50 keystrokes in succession (approx!) the scroll lock light came on. I froze, because I knew this was the signal that the vt is about to crash X on reception of certain scan-codes. I was sly enough quietly slip into vt1 and do:

root@lucid:~# date; while true; do sleep .5; stty -F /dev/tty2 -isig; done
Tue Sep 14 16:28:05 CEST 2010

You can see how that would provide me with (racy) immunization against terminating the X session.
I'm still on the same terminal/X session now at 16:53. Every so often, the scrolllock light will toggle on or off (keys must be pressed for it to happen but there doesn't seem to be a pattern in timing, sometimes quite rapidly in succession).

The terminal 'flags' do change every now and then, but not in synch with the scrolllock light. Note again how I do not normally have a functioning scrolllock key/light (not even when X is on a separate vt7 without any other confusion).

Approx. halft the time the output of 'stty -F /dev/tty2 'is

speed 38400 baud; line = 0;
    -brkint ixoff -imaxbel
    -isig -iexten -echo

Once in a while the output changes to the following for some time:

speed 38400 baud; line = 0;
    min = 1; time = 0;
    -brkint -icrnl -imaxbel
    -opost -onlcr
    -isig -icanon -iexten -echo -echoe -echok -echoctl -echoke

Note that this is when I'm keeping -isig artificially 'low' by using the background loop running on vt1. I wouldn't be typing this otherwise :)
The symptoms went away when I issued 'stop tty2' 
The time is 16:59 now

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-14:

#130

Steve,

> You might also be able to confirm this is the problem by commenting out
> the "graphics-device-added fb0" part of the start condition.

Okay, testing with standard disto version of /etc/default/grub and the following in my /etc/init/gdm.conf:

start on (filesystem
          and started dbus
          and (drm-device-added card0 PRIMARY_DEVICE_FOR_DISPLAY=1
               or stopped udevtrigger))
stop on runlevel [016]

Boot seemed to be fairly quick and I did NOT notice the flicker I used to see after the splash screen. It was a pretty smooth transition to the GDM login screen, and I'm on tty7. I've also been working for 40 minutes now without a crash, which seems promising. Are there any logs you'd like me to report with this configuration/boot?

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-14: Re: [Bug 625239] Re: X starts on wrong tty because gdm starts before nvidia driver is ready

#131

On 09/14/2010 05:09 PM, Jamie Krug wrote:
> Steve,
>
>
>> You might also be able to confirm this is the problem by commenting out
>> the "graphics-device-added fb0" part of the start condition.
>>
> Okay, testing with standard disto version of /etc/default/grub and the
> following in my /etc/init/gdm.conf:
>
> start on (filesystem
> and started dbus
> and (drm-device-added card0 PRIMARY_DEVICE_FOR_DISPLAY=1
> or stopped udevtrigger))
> stop on runlevel [016]
>
> Boot seemed to be fairly quick and I did NOT notice the flicker I used
> to see after the splash screen. It was a pretty smooth transition to the
> GDM login screen, and I'm on tty7. I've also been working for 40 minutes
> now without a crash, which seems promising. Are there any logs you'd
> like me to report with this configuration/boot?
>
>
So let me get this straight, you just removed the "graphics-device-added
fb0 PRIMARY_DEVICE_FOR_DISPLAY=1" disjunction?
I might try that too. This could be brittle, because all this could do
is reduce the time window for the race condition to happen. I will
however give that a shot so I can confirm whether it would help at my
system too.

Seth

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-14:

#132

Seth,

> So let me get this straight, you just removed the "graphics-device-added
> fb0 PRIMARY_DEVICE_FOR_DISPLAY=1" disjunction?

Yes, that is exactly what I did, and nothing else.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-14:

#133

Steve, Seth,

The "graphics-device-added fb0 PRIMARY_DEVICE_FOR_DISPLAY=1" removal tweak is not a guaranteed workaround after all. I just booted up again and crashed within a minute of use. I was just waiting for a Chrome tab to open--no keyboard activity this time.

Revision history for this message

Claudio Moretti (flyingstar16) wrote on 2010-09-14:

#134

On Tue, Sep 14, 2010 at 07:58, Steve Langasek
<email address hidden>wrote:

> has anyone reproduced this nvidia+gdm crash on maverick? Marking
> incomplete for maverick pending confirmation.
>
On my system, this bug was fixed.

Revision history for this message

Steve Langasek (vorlon) wrote on 2010-09-14:

#135

On Tue, Sep 14, 2010 at 08:21:09AM -0000, Seth wrote:
> PS. the description no longer reflects that people with ATI cards have
> reported this issue (have I misread the comments or did you miss that?)

The only comments in this report from someone seeing issues on ATI are from
the original submitter, who was experiencing the apache2 stty bug. Brian
claims in comment #93 that he's havng "the problem" on intel graphics, but
offers no concrete information in support of this.

All the evidence points to an nvidia-specific race condition.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-14:

#136

On 09/14/2010 06:47 PM, Steve Langasek wrote:
> On Tue, Sep 14, 2010 at 08:21:09AM -0000, Seth wrote:
>
>> PS. the description no longer reflects that people with ATI cards have
>> reported this issue (have I misread the comments or did you miss that?)
>>
> The only comments in this report from someone seeing issues on ATI are from
> the original submitter, who was experiencing the apache2 stty bug. Brian
> claims in comment #93 that he's havng "the problem" on intel graphics, but
> offers no concrete information in support of this.
>
> All the evidence points to an nvidia-specific race condition.
>
>

Ok thanks for straightening that out

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-14:

#137

On 09/14/2010 05:49 PM, Jamie Krug wrote:
> no keyboard activity this time.
>
Erm... perhaps this is a different bug? Did you inspect the log(s) to
see what aborted X?

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-14:

#138

I seem to have discovered a slightly new style of crashes, which I believe have only occurred when I have a modified /etc/init/gdm.conf file. Yesterday and today I've witnessed this new crash behavior with both:
* Seth's /etc/init/gdm.conf workaround from https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/532047/comments/91
* Steve's suggested test that I note in comment #130 (which didn't crash immediately, but did upon a second boot)

These are the characteristics of my "new" crash experience:
* Boots with X on tty7 and seems fine for a while.
* Crashes without keyboard cause (sometimes using mouse/click, one time doing nothing but waiting for IntelliJ IDEA to open).
* Crash to login screen, then X on tty8.
* After crash and relogin to tty8, usually another crash very quickly (again, not related to keyboard activity), which then moves X to tty9.
* After a crash, I'm unable to switch to any other tty--all show completely blank screen.

My original issue, which I can reproduce any time by simply keeping the original /etc/init/gdm.conf and /etc/default/grub in place, was unique to the above in the following ways:
* First boot always to tty1 or tty2 (usually tty2).
* Always crashed within an hour of working, when pressing Enter.
* Always on tty8 after second login, and never crashed again until a reboot.
* Both before and after a crash, I was still able to switch to other ttys with no problem.

The one workaround that seems to always work, to the best of my knowledge, is when I only add "vga=799" to the linux boot line, and change nothing else. I'm pretty sure I went days without a crash with Seth's gdm.conf tweak, so it was odd when it suddenly started crashing like crazy yesterday.

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-14:

#139

On 09/14/2010 07:50 PM, Jamie Krug wrote:
> I seem to have discovered a slightly new style of crashes, which I believe have only occurred when I have a modified /etc/init/gdm.conf file.

May I suggest _not_ tweaking your gdm.conf in that particular way then?

> change nothing else. I'm pretty sure I went days without a crash with
> Seth's gdm.conf tweak, so it was odd when it suddenly started crashing
> like crazy yesterday.
>
>
Perhaps you did not notice, but I have posted a fix that works for me.
It is a patch against gdm and the package is available from
https://launchpad.net/~bugs-sehe/+archive/gdm625239

To test: simply

    sudo add-apt-repository "deb
    http://ppa.launchpad.net/bugs-sehe/gdm625239/ubuntu lucid main"
    sudo apt-get update
    sudo apt-get install --reinstall gdm

To revert:

    sudo sed -i '/bugs-sehe\/gdm625239/d' /etc/apt/sources.list
    sudo apt-get update
    sudo apt-get install --reinstall gdm

If apt-get baulks at the reinstall step (there seems to be a bug hidden
there), you might use synaptic 'Package/Force Version... (Ctrl+E)' menu
item to select the desired version. Of course you can always simply
'apt-get remove gdm; apt-get install ubuntu-desktop' but I personally
hate the big-gun approach :)

I'm very curious to know whether it'll fix your situation as well

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-14:

#140

I just noticed there is a slightly cleaner approach to the source manipulation fu:

sudo apt-add-repository ppa:bugs-sehe/gdm625239

and vice-versa:

sudo rm -fv /etc/apt/sources.list.d/bugs-sehe-gdm625239-lucid.list

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-14:

#141

Seth,

> May I suggest _not_ tweaking your gdm.conf in that particular way then?

Yes :) I did so only as a workaround, but obviously that was not stable. I also just discovered that the same crashes can now occur even with a untouched gdm.conf. Possibly a different bug, sure, but I hope not!?!

> Perhaps you did not notice, but I have posted a fix that works for me.
> It is a patch against gdm and the package is available from
> https://launchpad.net/~bugs-sehe/+archive/gdm625239

I had not noticed you mention a fix, but not realize it was made available in your PPA. Thank you very much for the detailed instructions--that's what I needed.

BTW, I just disabled my nVidia drivers, so I could at least post an update without crashing, and I noticed that my boot was notably MUCH faster. I'll be curious to see if I notice the same with your fix. Off to apply your patch. Thanks!

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-14:

#142

force.sh Edit (350 bytes, application/x-sh; name="force.sh")

On 09/14/2010 05:19 PM, Seth wrote:
> This could be brittle, because all this could do
> is reduce the time window for the race condition to happen
I can confirm my suspicion to be right because after - granted: many-
some reboots I saw the log message again in successive reboots:

    /var/log/syslog:Sep 14 20:38:39 localhost gdm-binary[984]: WARNING:
    GdmDisplay: display lasted 0.695752 seconds
    /var/log/syslog:Sep 14 20:42:04 localhost gdm-binary[1060]: WARNING:
    GdmDisplay: display lasted 0.198205 seconds

This was running _with_ my fixed gdm package, but from the analysis we
know that in this case the X desktop would have ended up shared with a
getty. If you are interested, you can easily verify for yourself how the
"nvidia not ready" condition will generate this message by using the
attached reproduction script:

    #!/bin/bash
    stop gdm
    sleep 2
    pgrep -l X && pkill X
    rm -fv /var/run/gdm/firstserver.stamp
    while lsmod | grep -q nvidia
    do
        echo unloading removing nvidia module
        rmmod nvidia
        sleep 1
    done
    lsmod | grep nv | nl
    {
        modprobe nvidia-current &
        start gdm &
        wait
    }
    sleep 5; chvt 1
    grep "display lasted" /var/log/syslog | tail
    date +"Time is now: %D %T"

Such a script, unfortunately, is not suitable for testing upstart
mutations (because the other subsystems interacting are not being
restarted; only a boot will give the right order and timing for these...)

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-14:

#143

On 09/14/2010 09:28 PM, Jamie Krug wrote:
> BTW, I just disabled my nVidia drivers, so I could at least post an
> update without crashing, and I noticed that my boot was notably MUCH
> faster. I'll be curious to see if I notice the same with your fix. Off
> to apply your patch. Thanks!
>
If you are desparate for a workaround, the following should work, just
append it to rc.local (before exit 0...)

    gdmtty="/dev/$(ps --no-heading -o tty -p $(pgrep X))"
    while true
    do
        sleep .1
        stty -F "$gdmtty" -isig
    done

As I've documented before, I could be running 'stably' with X on a tty
shared with getty for more than an hour, seeing all the other signs
(scroll lock light toggling slowly etc.)

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-14:

#144

Seth, thanks for the additional workaround. I've applied your patch from ppa:bugs-sehe/gdm625239 and all has been well for about an hour. My boot time seemed a little slow again, after re-enabling my nVidia drivers, but I can live with that if I have a stable X session! I will report back if I see another crash, but so far your fix appears to be successful for me as well. Thanks.

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-14:

#145

bootcharts.zip Edit (449.5 KiB, application/zip; name="bootcharts.zip")

On 09/14/2010 10:16 PM, Jamie Krug wrote:
> Seth, thanks for the additional workaround. I've applied your patch from
> ppa:bugs-sehe/gdm625239 and all has been well for about an hour. My boot
> time seemed a little slow again, after re-enabling my nVidia drivers,
> but I can live with that if I have a stable X session! I will report
> back if I see another crash, but so far your fix appears to be
> successful for me as well. Thanks.
>

Well, I can't say my boot has slowed down. My last boots were about 8
seconds (with 'my' gdm fix, attached bootchart[1]) so it would seems
that any slowness will not be attributable to nvidia init timing.
However, if you were to tell me that your boot time used to be _less_
than that ... I'm all ears !

I'm spoiled either way!

[1] I attach an older one too for comparison; the comparison is not
completely fair because I've since loaded my workstation with apache2
mod-mono for my summer tech workshop :) this, and perhaps a few other
'gadgets' may have slowed down the boot a tiny bit.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-14:

#146

Seth,

I just installed bootchart, rebooted and had 2 quick crashes (with your patch installed). I only see black screen when switching tty, so unable to check out logs so far. I'll report back soon.

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-15:

#147

bootcharts.tar.gz Edit (634.0 KiB, application/x-tar)

Seth,

With just a couple bootcharts, it appears I take 4 seconds longer to boot with nVidia drivers enabled. Also, despite having an extremely fast machine, your boots were about 400% faster than mine! I've attached the two, in case they're of any interest.

Is there any reason installing bootchart would reintroduce this bug, or does this just mean that your fix simply doesn't work for me? I could remove bootchart and see... Thanks.

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-15:

#148

maverick.log Edit (1.1 KiB, text/plain)

@Steve:

I've taken the trouble to install maverick beta (64bit). I installed on the same iron, same disk configuration (2x SSD in raid0, lvm2 root). After doing the server install I installed ubuntu-desktop and bootchart.

My few boots took max 5.71s in bootchart (woah) and all my X desktops were 'first time right' (with the explicit vt7 param visible in ps -ef).

So that seems to imply that the problem does not manifest itself on maverick (yet). However, since we have such good understanding of the triggering events I decided to check with my force.sh (forced reproduction script from comment #142).

Surprisingly, not only is the new nvidia/xorg combination vulnerable to the same start up race, what's more: the timing has gotten faster by just enough that gdm-simple-slave (gdm-server.c) runs out of launching attempts. The end result in that scenario is _not pretty_: the upstart job ("task"?) is in status 'running' while there is no X session at all!

All in all, it appear that the potential that this race occurs on Maverick is present just like in Ubuntu. Timings are different, and I haven't yet spontaneously encountered the situation on boot. But on manual triggering the behaviour is far worse.

Attached is the output of running force.sh on maverick (completely uptodate as of now)

Revision history for this message

Steve Langasek (vorlon) wrote on 2010-09-15:

#149

Hi Seth,

On Wed, Sep 15, 2010 at 12:25:13AM -0000, Seth wrote:
> So that seems to imply that the problem does not manifest itself on
> maverick (yet). However, since we have such good understanding of the
> triggering events I decided to check with my force.sh (forced
> reproduction script from comment #142).

> Surprisingly, not only is the new nvidia/xorg combination vulnerable to
> the same start up race,

That's not surprising to me at all. Your force.sh does not accurately model
the conditions on startup; instead of a trigger for gdm starting based on
udev events, you are forcing gdm to start in parallel to the modprobe of
nvidia-current, which is always going to be racy.

> what's more: the timing has gotten faster by just enough that
> gdm-simple-slave (gdm-server.c) runs out of launching attempts. The end
> result in that scenario is _not pretty_: the upstart job ("task"?) is in
> status 'running' while there is no X session at all!

Oh. The intended behavior in this situation is that the gdm process will
exit non-zero, so that the /etc/init/failsafe-x.conf job will start up and
present the user with appropriate options. It sounds like this is a
separate regression in maverick; would you mind filing a bug report against
the gdm package for this?

> All in all, it appear that the potential that this race occurs on
> Maverick is present just like in Ubuntu. Timings are different, and I
> haven't yet spontaneously encountered the situation on boot. But on
> manual triggering the behaviour is far worse.

force.sh doesn't establish this at all. There *could* still be a race -
successful boots don't rule this out - but if so, we need to find it based
on looking at /var/log/udev from maverick and/or by reproducing it with an
actual boot.

Actually, there might be an easier way. Could you boot with '--verbose'
added to the boot options, and attach the resulting /var/log/syslog? This
should show us what events upstart is seeing related to video cards, and
may be enough to confirm/deny that the requisite event is being sent in
response to the nvidia driver and not an unrelated framebuffer driver.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Hi Seth,

On Wed, Sep 15, 2010 at 12:25:13AM -0000, Seth wrote:
> So that seems to imply that the problem does not manifest itself on
> maverick (yet). However, since we have such good understanding of the
> triggering events I decided to check with my force.sh (forced
> reproduction script from comment #142).

> Surprisingly, not only is the new nvidia/xorg combination vulnerable to
> the same start up race,

That's not surprising to me at all.  Your force.sh does not accurately model
the conditions on startup; instead of a trigger for gdm starting based on
udev events, you are forcing gdm to start in parallel to the modprobe of
nvidia-current, which is always going to be racy.

> what's more: the timing has gotten faster by just enough that
> gdm-simple-slave (gdm-server.c) runs out of launching attempts.  The end
> result in that scenario is _not pretty_: the upstart job ("task"?) is in
> status 'running' while there is no X session at all!

Oh.  The intended behavior in this situation is that the gdm process will
exit non-zero, so that the /etc/init/failsafe-x.conf job will start up and
present the user with appropriate options.  It sounds like this is a
separate regression in maverick; would you mind filing a bug report against
the gdm package for this?

> All in all, it appear that the potential that this race occurs on
> Maverick is present just like in Ubuntu. Timings are different, and I
> haven't yet spontaneously encountered the situation on boot. But on
> manual triggering the behaviour is far worse.

force.sh doesn't establish this at all.  There *could* still be a race -
successful boots don't rule this out - but if so, we need to find it based
on looking at /var/log/udev from maverick and/or by reproducing it with an
actual boot.

Actually, there might be an easier way.  Could you boot with '--verbose'
added to the boot options, and attach the resulting /var/log/syslog?  This
should show us what events upstart is seeing related to video cards, and
may be enough to confirm/deny that the requisite event is being sent in
response to the nvidia driver and not an unrelated framebuffer driver.

-- 
Steve Langasek                   Give me a lever long enough and a Free OS
Debian Developer                   to set it on, and I can move the world.
Ubuntu Developer                                    http://www.debian.org/
slangasek@ubuntu.com                                     vorlon@debian.org

Revision history for this message

Steve Langasek (vorlon) wrote on 2010-09-15:

#150

On Tue, Sep 14, 2010 at 02:23:31PM -0000, Jamie Krug wrote:
> Attaching /var/log/udev, /var/log/plymouth-debug.log and /proc/fb after boot with no GRUB_GFXMODE set and:
> GRUB_CMDLINE_LINUX_DEFAULT="vga=799 plymouth:debug"
> GRUB_CMDLINE_LINUX=""

Thanks. Could you now provide the same files when *not* booting with this
workaround?

One curious thing I notice in the plymouth log is that none of the renderers
load:

[./plugin.c] open_device:could not open '/dev/fb0': No such file or directory

Can you confirm that when booting this way, you get a text-only plymouth
splash screen?

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message

Steve Langasek (vorlon) wrote on 2010-09-15:

#151

On Tue, Sep 14, 2010 at 02:24:22PM -0000, Jamie Krug wrote:
> > Seth, Jamie, can one of you also show the output of 'grep fb
> > /var/log/kern.log' on a system showing this problem?

> Attached, after boot with standard configuration (no workarounds or
> related custom config at all).

> ** Attachment added: "kern.log-grep-fb.out"
> https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers/+bug/625239/+attachment/1583163/+files/kern.log-grep-fb.out

Great - it happens that this captures the output both when running with
grub_gfxmode (efifb runs, vga16fb tryies to start and fails), and without
(vga16fb starts and claims fb0). In both cases, the only framebuffer we see
here is unrelated to the nvidia kernel module, which explains how gdm is
starting before nvidia is ready.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message

Steve Langasek (vorlon) wrote on 2010-09-15:

#152

On Tue, Sep 14, 2010 at 07:38:49PM -0000, Seth wrote:
> > This could be brittle, because all this could do
> > is reduce the time window for the race condition to happen
> I can confirm my suspicion to be right because after - granted: many-
> some reboots I saw the log message again in successive reboots:

> /var/log/syslog:Sep 14 20:38:39 localhost gdm-binary[984]: WARNING:
> GdmDisplay: display lasted 0.695752 seconds
> /var/log/syslog:Sep 14 20:42:04 localhost gdm-binary[1060]: WARNING:
> GdmDisplay: display lasted 0.198205 seconds

Can you include the gdm/X logs from these sessions, showing why it reset so
quickly - just to be absolutely certain of what we're seeing?

> This was running _with_ my fixed gdm package, but from the analysis we
> know that in this case the X desktop would have ended up shared with a
> getty.

<nod>

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message

Martin Pitt (pitti) wrote on 2010-09-15:

#153

This sounds a lot like bug 615549 which was fixed yesterday. Is this still an issue with gdm 2.30.5-0ubuntu4?

Revision history for this message

Steve Langasek (vorlon) wrote on 2010-09-15:

#154

On Tue, Sep 14, 2010 at 03:49:25PM -0000, Jamie Krug wrote:
> The "graphics-device-added fb0 PRIMARY_DEVICE_FOR_DISPLAY=1" removal
> tweak is not a guaranteed workaround after all. I just booted up again
> and crashed within a minute of use. I was just waiting for a Chrome tab
> to open--no keyboard activity this time.

That sounds like a different sort of crash, then. Please show the gdm/X
logs from the crash in question - as well as for any previous X server start
attempts as part of this boot cycle.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message

Steve Langasek (vorlon) wrote on 2010-09-15:

#155

On Wed, Sep 15, 2010 at 06:55:49AM -0000, Martin Pitt wrote:
> This sounds a lot like bug 615549 which was fixed yesterday.

Yep, sounds like exactly the same bug.

> Is this still an issue with gdm 2.30.5-0ubuntu4?

So far the only test on maverick from someone who's confirmed to see this
bug on lucid has failed to reproduce the problem there, even /without/ the
new gdm package. Nevertheless, I think there's enough information here to
mark this as a duplicate of 615549 and target that bug for SRU.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-15:

#156

On 09/15/2010 07:23 AM, Steve Langasek wrote:
> Can you include the gdm/X logs from these sessions, showing why it reset so
> quickly - just to be absolutely certain of what we're seeing?
>
Good call. I've reviewed all leftover Xorg logs and there were no (EE)
messages, particularly note related to nvidia. I have no reason to
believe that logs have gone missing, so I need to be more sure to make
that call. I will keep running with the modded upstart condition (which
incidentally matches the one from maverick, I believe?):

    start on (filesystem
              and started dbus
              and (drm-device-added card0 PRIMARY_DEVICE_FOR_DISPLAY=1
                   or stopped udevtrigger))
    stop on runlevel [016]

Let you know how I fare

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-15:

#157

On 09/15/2010 03:02 AM, Steve Langasek wrote:
> Actually, there might be an easier way. Could you boot with '--verbose'
> added to the boot options, and attach the resulting /var/log/syslog?
Will do later tonight

Revision history for this message

Martin Pitt (pitti) wrote on 2010-09-15:

#158

Seth [2010-09-15 7:53 -0000]:
> I will keep running with the modded upstart condition (which
> incidentally matches the one from maverick, I believe?):

Correct.

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-15:

#159

gdm_debug.sh Edit (687 bytes, application/x-sh; name="gdm_debug.sh")

Download full text (3.4 KiB)

On 09/15/2010 09:17 AM, Steve Langasek wrote:
> So far the only test on maverick from someone who's confirmed to see this
> bug on lucid has failed to reproduce the problem there, even /without/ the
> new gdm package. Nevertheless, I think there's enough information here to
> mark this as a duplicate of 615549 and target that bug for SRU.
>
Count me in. I just went the 'IKEA way' and added a reboot loop section
to my rc.local that logs all conceivable related info per boot[1]. The
used script has been attached. Per boot, a number of logfiles is
collected, various ps listings, package versions, upstart config is
saved in a tarball.

This resulted in 79 tarballs, which I intend to attach later in sofar
Martin/Steve have been asking for the info contained.

I tested 79 boots (_real_ boots) in all, in three batches:

--------------------------------------------------
(A) with plain vanilla lucid config and package: 40 boots. 30 of these
boots landed on tty2 rather than tty7:

All tty2 occurences coincide with multiple (1 _or_ 2) messages in the
logs about re-attempting to launch X:

    root@lucid:~/first# for a in gdmdebug_1*; do tar xOf $a
    var/log/daemon.log | grep 'display lasted' | wc -l; done | sort |
    uniq -c
         25 0
          6 1
          9 2

Interestingly, of the failing cases (75% of total boots), two-thirds
required 3 attempts, against one-third requiring two attempts to launch
X. Ah, well.

--------------------------------------------------
(B) with plain vanilla packages but maverick-style startup condition
(from bug 615549). 23 out of 23 land on tty7 just fine

--------------------------------------------------
(C) in an effort to really torture test this when really reducing the
boot time to the absolute minimum ), i switched of as many services as I
could. Resulting boot time (1 sample) was around 7.20s. Obviously the
maverick fresh install was even faster than that.

Happily, 16 out of 16 boots landed X on tty7

All three batches were running the standard lucid package of gdm (manual
modifications done to gdm.conf in B and C):

    root@lucid:~# for a in */gdmdebug_*; do tar xOf $a | grep ^Version;
    done | uniq -c
         79 Version: 2.30.2.is.2.30.0-0ubuntu3

Hence, I would consider the fix in #615549 valid for this bug. Note that
it needs to be back-ported to lucid!
Finally, I think I will report the premature creation of
/var/run/gdm/firstserver.stamp as well as broken handling of gdm start
failure as bugs with gdm.

Thanks for all the thinking along, and especially Martin(?) for coming
up with the root cause.

[1] quote:

#!/bin/bash
#Not: set -e

...

    sleep 10
    chvt 1; echo "INFO: auto logging debug info"
   ...

On 09/15/2010 09:17 AM, Steve Langasek wrote:
> So far the only test on maverick from someone who's confirmed to see this
> bug on lucid has failed to reproduce the problem there, even /without/ the
> new gdm package.  Nevertheless, I think there's enough information here to
> mark this as a duplicate of 615549 and target that bug for SRU.
>   
Count me in. I just went the 'IKEA way' and added a reboot loop section
to my rc.local that logs all conceivable related info per boot[1]. The
used script has been attached. Per boot, a number of logfiles is
collected, various ps listings, package versions, upstart config is
saved in a tarball.

This resulted in 79 tarballs, which I intend to attach later in sofar
Martin/Steve have been asking for the info contained.

I tested 79 boots (_real_ boots) in all, in three batches:

--------------------------------------------------
(A) with plain vanilla lucid config and package: 40 boots. 30 of these
boots landed on tty2 rather than tty7:

All tty2 occurences coincide with multiple (1 _or_ 2) messages in the
logs about re-attempting to launch X:

root@lucid:~/first# for a in gdmdebug_1*; do tar xOf $a
    var/log/daemon.log | grep 'display lasted' | wc -l; done | sort |
    uniq -c
         25 0
          6 1
          9 2

Interestingly, of the failing cases (75% of total boots), two-thirds
required 3 attempts, against one-third requiring two attempts to launch
X. Ah, well.

--------------------------------------------------
(B) with plain vanilla packages but maverick-style startup condition
(from bug 615549). 23 out of 23 land on tty7 just fine

--------------------------------------------------
 (C) in an effort to really torture test this when really reducing the
boot time to the absolute minimum ), i switched of as many services as I
could. Resulting boot time (1 sample) was around 7.20s. Obviously the
maverick fresh install was even faster than that.

Happily, 16 out of 16 boots landed X on tty7

All three batches were running the standard lucid package of gdm (manual
modifications done to gdm.conf in B and C):

root@lucid:~# for a in */gdmdebug_*; do tar xOf $a | grep ^Version;
    done | uniq -c
         79 Version: 2.30.2.is.2.30.0-0ubuntu3

Hence, I would consider the fix in #615549 valid for this bug. Note that
it needs to be back-ported to lucid!
Finally, I think I will report the premature creation of
/var/run/gdm/firstserver.stamp as well as broken handling of gdm start
failure as bugs with gdm.

Thanks for all the thinking along, and especially Martin(?) for coming
up with the root cause.

[1] quote:

#!/bin/bash
    #Not: set -e

...

sleep 10
    chvt 1; echo "INFO: auto logging debug info"
    bash /etc/gdm_debug.sh

stop gdm
    echo "WARNING: purging logs and rebooting in 4 seconds"
    sleep 4
    rm -fv /var/log/Xorg.* /var/log/gdm/* /var/run/gdm/firstserver.stamp
    rm    /var/log/{syslog,user.log,kern.log,daemon.log}
    touch /var/log/{syslog,user.log,kern.log,daemon.log}
    reboot
    exit 0

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-15:

#160

On 09/15/2010 03:02 AM, Steve Langasek wrote:
> Actually, there might be an easier way. Could you boot with '--verbose'
> added to the boot options
Would that go into grub config? I'm not used to seeing --flags on the
kernel param line?

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-15:

#161

bug-625239-logs.tar.gz Edit (82.8 KiB, application/x-tar)

Steve Langasek wrote in #150:
> Thanks. Could you now provide the same files when *not* booting with this
> workaround?

Yes, the attached archive contains two sets of logs, both without vga=799 or any other workaround in place. One has "quiet splash" appended to the boot line, which landed X on tty2 (as expected, and would likely crash eventually). The other has "plymouth:debug" appended to the boot line, which landed X on tty7 (and is generally more stable, rarely crashing--another near workaround). I've included udev, /proc/fb and any relevant X/gdm logs. HTH!

> One curious thing I notice in the plymouth log is that none of the renderers
> load:
>
> [./plugin.c] open_device:could not open '/dev/fb0': No such file or directory
>
> Can you confirm that when booting this way, you get a text-only plymouth
> splash screen?

Yes, with vga=799 I see a text-only splash.

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-15:

#162

for info, I have deleted ppa:bugs-sehe/gdm625239 because I don't think it is relevant anymore. The patch is still available for reference in comment #109

Revision history for this message

Mohamed Amine Ilidrissi (ilidrissi.amine) wrote on 2010-09-15:

#163

Confirming on Maverick, but before changing the status, is there any other information you would like me to report?

Revision history for this message

Steve Langasek (vorlon) wrote on 2010-09-15:

#164

On Wed, Sep 15, 2010 at 06:32:41PM -0000, Mohamed Amine IL Idrissi wrote:
> Confirming on Maverick, but before changing the status, is there any
> other information you would like me to report?

Confirming with what version of gdm? The latest version of gdm in maverick,
2.30.5-0ubuntu4, is believed to correct this issue for systems with the
nvidia binary drivers.

We would also want to see gdm/X logs showing that you're experiencing the
same crash, before considering this a confirmation.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message

Jamie Krug (jamie-thekrugs) wrote on 2010-09-16:

#165

Forgive me if I'm missing something, but based on comments from Martin and Steve, and Seth's thorough testing reported in comment #159, my understanding is that I should simply wait for the bug #615549 fix to be back-ported to Lucid. Is this correct? Should I be updating bug #615549 for Lucid in some way, or will this bug continue as a way to get that back-port? I don't want to step on toes of developers or other who know better, but I didn't want to sit here doing nothing if I could help move this along :) Thanks.

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2010-09-20:

#166

@Jamie:
Looks like Martin is taking care of it in #615549. If I recall correctly, it will be (has been?) uploaded to the proposed queue and, after testing, it will be released to Lucid.

Revision history for this message

Seth (bugs-sehe) wrote on 2010-09-20:

#167

On 09/20/2010 03:39 PM, John S. Gruber wrote:
> *** This bug is a duplicate of bug 615549 ***
> https://bugs.launchpad.net/bugs/615549
>
> @Jamie:
> Looks like Martin is taking care of it in #615549. If I recall correctly, it will be (has been?) uploaded to the proposed queue and, after testing, it will be released to Lucid.
>
Yep and I vouch for it to work (having done extensive automated boot
testing on both lucid and maverick. On maverick my boot time can be as
low as 5s according to bootchart.)

Ubuntu
nvidia-graphics-drivers package

X starts on wrong tty because gdm starts before nvidia driver is ready

Bug Description

Other bug subscribers

Patches

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
gdm (Ubuntu)	Incomplete	High	Unassigned
Lucid	Confirmed	Undecided	Unassigned
Maverick	Incomplete	High	Unassigned
nvidia-graphics-drivers (Ubuntu)	Incomplete	High	Unassigned
Lucid	Confirmed	Undecided	Unassigned
Maverick	Incomplete	High	Unassigned

Ubuntunvidia-graphics-drivers package

X starts on wrong tty because gdm starts before nvidia driver is ready

Bug Description

Other bug subscribers

Patches

Bug attachments

Remote bug watches

Ubuntu
nvidia-graphics-drivers package