Comment 433 for bug 88746

Revision history for this message
David Becker (becker-david) wrote :

I seem to have solved (i.e., worked around) my previous issues. Note that the "lost page write" errors I'm getting are likely due to the lvm copyonwrite store I'm using. When the store overflows, all kinds of strange things happen, but this is probably unrelated to the issues going on here. "df" output also doesn't reflect the actual usage of the store, so this error may occur when you're not expecting it, i.e. when you would otherwise think you have enough storage space (when actually you don't by virtue of the copyonwrite store overflowing).

Anyway, I've installed linux to a USB drive without the copyonwrite store and I don't have any more disconnect, lost page write or other instability problems. Note that the USB drive and the computer(s) are the same pieces of hardware that were previously producing the errors. All AMD/ATI hardware.

But, I also connected the same USB drive to a Proliant which has (mostly) intel hardware. Low speed USB is uchi while high speed is ehci based.

Now I also got several errors with the Proliant (ehci) which leads me to believe that the errors could likely have something to do with either hald or dbus communication (deficiencies).

I was preparing another live-cd-on-a-usb-stick. This process involves mounting the usb drive, then mounting additional (tmpfs) filesystems to the usb drive, then prepare the drive (formatting), then populate the drive. The copyonwrite store is also involved at this stage.

Now during the preparation stage, with the USB drive mounted, I decided to abort the process. Here's where the errors started occuring. I hit ctrl-c on the process which is creating the livecd-usb and started receiving disconnect errors. Note that a ctrl-c could be analogous to a USB disconnect, although there's also significant differences (since the signals originate from different sources). I manually unmounted the filesystems involved in the preparation process. One would expect that I would then be able to (physically) remove the drive, reinsert the drive and then start the process over, but that wasn't the case. When I reinserted the drive, I couldn't access the drive anymore. No real errors messages, it seems as if the port was unavailable. I did what I normally never have to do (with this machine), I rebooted.

After reboot, I couldn't use that port anymore. I kept getting disconnect errors. Things were going from bad to worse and I rebooted the machine again. I then tried the same process on the front-side ports (was using a port on the back previously). I had no problem performing aforementioned process to completion on the front side port.

I now have this funny feeling that something is going wrong with the mount-state of the filesystems involved. This really reminds me of removing a floppy drive on Sun workstations without having invoked the "eject" command (which unmounts the floppy prior to physically ejecting the floppy disk).

It would seem that "disconnects" are quite normal on a USB bus. It does however get very tricky with the dependencies once a disconnect occurs (is it intermittent or permanent?). This likely differs between devices type (printer daemon and filesystem may respond differently), but it seems as if a discrepency arises between the device connection state and how the dependent processes/modules (the latter thus being the mounting/filesystem or printer subsystem) perceive that state. It wouldn't surprise me if the connection state doesn't correspond with the actual device state. From that moment on it's fubar until by chance the actual device state corresponds with the perceived state (within relevant modules/processes).

FWIW, possibly a long shot, I'd ask the people who are receiving errors during large transfers to disable hald prior to initiating the transfer. That is, have the device (auto) mounted, then disable hald, then start the transfer. The same thing could be causing printer errors and even usb wireless devices to reach a state of no go.

Disabling hald may obviously defeat your (other) purposes, but it could isolate the problem (or possibly just rule out hald's involvement in this ordeal).

Hope this helps,

David