Comment 11 for bug 191119

Revision history for this message
Alvin Thompson (alvint-deactivatedaccount) wrote :

I'm adding this message from the thread to provide as much information for people trying to recreate this bug as possible;

Top-posting for (my) convenience on this one...

It's nice to see someone actually trying to recreate the bug instead of
just flapping their gums trying to sound smart. If you have the time,
could you try recreating it again? I have some suggestions to make it
more like my scenario (in fact, number 3 below is required for the
problem to occur, and number 4 below is likely to be required). I know
these suggestions are long, but it would be appreciated. In for a
penny, in for a pound, eh? You'll have to do the first install again,
but you won't have to actually go through with the second install. You
can cancel after getting past the partitioning screen. I've noticed
that when things go awry there are two tell-tale signs:

1. On the partitioning screen, the original (now defunct) file system(s)
will be detected and show up.

2. Once you select "finish partitioning", the installer will show a list
of partition tables that will be modified. One or more RAID partitions
will show up on the list, irregardless of that fact that you didn't
select them for anything.

If those signs are present, the RAID array will be hosed. If the signs
are not there, the install will go fine and there's no need to continue.
  Additionally, you don't have to worry about test data or even mounting
the RAID array.

When doing the second install, if you post exactly which file systems
were detected on the manual partitioning screen and which partitions
were shown on the "to be modified" list once you hit "finish
partitioning", I'd appreciate it. Now on to the suggestions:

1. It sounds like in your setup you installed for the second time after
setting up the RAID array, but before the array finished resyncing for
the first time. In my setup, the array had been around for a while and
was fully resynced. In fact, I (likely) waited for the array to be
fully resynced before even installing XFS on it. If you *did* wait for
the drive to finish resyncing before the second install, please RSVP
because your array was indeed corrupted, but since it was only one drive
the array was somehow able to resync and recover.

2. I was using the ubuntu 10.4 beta 2 server 64-bit disk for the
(second) install when things went south. Could you try that one?

3. REQUIRED. It sounds like when doing the second install, you just
installed to an existing partition. In order for the problem to occur,
you have to remove/create a partition (even though you're leaving the
RAID partitions alone). If you recreate the partitions I used (6,
below), this will be taken care of.

4. POSSIBLY REQUIRED. When creating the RAID array with the default
options as you did in your first test, by default the array is created
in degraded mode, with a drive added later and resynced. This makes the
initial sync faster. Since I'm a neurotic perfectionist, I always
create my arrays with the much more manly and macho "--force" option to
create them "properly". Its very possible that doing the initial resync
with a degraded array will overwrite the defunct file system, while
doing the initial sync in the more "proper" way will not. Please use
the "--force" option when creating the array to take this possibility
into account.

5. My array has 4 drives. If possible, could you scare up a fourth
drive? If not, don't worry about it. Especially if you do number 7.

6. Prior to/during the infamous install, my partitions were as follows.
  If feasible, recreating as much of this as possible would be appreciated:

    sd[abcd]1 25GB
    sd[abcd]2 475GB

    My RAID5 array was sd[abcd]2 set up as md1, and my file systems were:

    sda1: ext4 /
    md1: xfs /data
    sd[bcd]1: (partitioned, but not used)

    Note I had no swap partition originally.

    On the manual partition screen of the ill-fated install, I left the
sd[abcd]2 partitions alone (RAID array), deleted all the sd[abcd]1
partitions, and created the following partitions:

    sd[abcd]1: 22GB RAID5 md2 /
    sd[abcd]3*: 3GB RAID5 md3 (swap)

    * POSSIBLY REQUIRED: Note that partition 2 of the drives' partition
tables was already taken by the RAID, so I created the 3GB partitions as
partition 3, even though the sectors in partition 3 physically resided
before the sectors in partition 2. This is perfectly "legal", if not
"normal" (fdisk will warn you about this). Please try to recreate this
condition if you can, because it's very possible that was the source of
the problems.

    BTW, all partitions were primary partitions.

    If you don't have that much space, you can likely get away with
making sd[abcd]2 as small as needed.

7. I simplified when recreating this bug, but in my original scenario I
had 2 defunct file systems detected by the installer: one on sda2 and
one on sdd2 (both ext4). That's why I couldn't just fail and remove the
corrupted drive even if I had known to do so at that point. I figure
the more defunct file systems there are, the more chances you have of
recreating the bug. So how about creating file systems on all four
partitions (sd[abcd]2) before creating the RAID array?

8. My original setup left the RAID partitions' type as "linux" instead
of "RAID autodetect". It's no longer necessary to set the partition
type for RAID members, as the presence of the RAID superblock is enough.
  When recreating the problem I did set the type to "RAID autodetect",
but to be thorough, try leaving the type as "linux".

9. If you *really* have too much time on your hands, my original ubuntu
install, used for creating the original file systems, was 8.10 desktop
64 bit. I created the non-RAID during the install and the RAID array
after the install, after apt-getting mdadm. I seriously doubt this
makes a difference though.

10. I was using an external USB DVD-ROM drive to due the install. It's
very remotely possible since the drive has to be re-detected during the
install process, it could wind up reshuffling the device letters. If
you have an external CD or DVD drive, could you try installing with it?

If you (or anybody) can try recreating the problem with this new
information I'd very much appreciate it.

Thanks,
Alvin

On 04/23/2010 02:22 PM, J wrote:
> FWIW, this is what I just went through, step by step to try to
> recreate a loss of data on an existing sofware raid array:
>
> 1: Installed a fresh Karmic system on a single disk with three partitions:
> /dev/sda1 = /
> /dev/sda2 = /data
> /dev/sda3 = swap
>
> all were primary partitions.
>
> 2: After installing 9.10, I created some test "important data" by
> copying the contents of /etc into /data.
> 3: For science, rebooted and verified that /data automounted and the
> "important data" was still there.
> 4: Shut the system down and added two disks. Rebooted the system.
> 5: Moved the contents of /data to /home/myuser/holding/
> 6: created partitions on /dev/sdb and /dev/sdc (the two new disks, one
> partiton each)
> 7: installed mdadm and xfsprogs, xfsdump
> 8: created /dev/md0 with mdadm using /dev/sda2, /dev/sdb1 and
> /dev/sdc1 in a RAID5 array
> 9: formatted the new raid device as xfs
> 10: configured mdadm.conf and fstab to start and automount the new
> array to /data at boot time.
> 11: mounted /data (my new RAID5 array) and moved the contents of
> /home/myuser/holding to /data (essentially moving the "important data"
> that used to reside on /dev/sda2 to the new R5 ARRAY).
> 12: rebooted the system and verified that A: RAID started, B: /data
> (md0) mounted, and C: my data was there.
> 13: rebooted the system using Lucid
> 14: installed Lucid, choosing manual partitioning as you described.
> **Note: the partitioner showed all partitions, but did NOT show the
> RAID partitions as ext4
> 15: configured the partitioner so that / was installed to /dev/sda1
> and the original swap partition was used. DID NOT DO ANYTHING with the
> RAID partitions.
> 16: installed. Installer only showed formatting /dev/sda1 as ext4,
> just as I'd specified.
> 17: booted newly installed Lucid system.
> 18: checked with fdisk -l and saw that all RAID partitions showed as
> "Linux raid autodetect"
> 19: mdadm.conf was autoconfigured and showed md0 present.
> 20: edited fstab to add the md0 entry again so it would mount to /data
> 21: did an mdadm --assemble --scan and waited for the array to rebuild
> 22: after rebuild/re-assembly was complete, mounted /data (md0)
> 23, verified that all the "important data" was still there, in my
> array, on my newly installed Lucid system.
>
> The only thing I noticed was that when I did the assembly, it started
> degraded with sda2 and sdb1 as active and sdc1 marked as a spare with
> rebuilding in progress.
>
> Once the rebuild was done was when I mounted the array and verified my
> data was still present.
>
> So... what did I miss in recreating this failure?