• [gentoo-user] Fun with mdadm (Software RAID)

    From Alan Mackenzie@21:1/5 to All on Fri Dec 20 11:50:03 2024
    Hello, Gentoo.

    After having got the syslinux boot manager working well, I lost the root partition on my newer machine. I spent the entire evening yesterday
    trying to get it back again, with various expedients for recovering ext4 partitions from backup superblocks, and so on.

    It wasn't until the middle of the night that it dawned on me what had
    happened, and I immediately got up and had it fixed within twenty
    minutes.

    The cause was me booting up the machine with a rescue disk. This
    assembled my RAID partitions /dev/md127 and /dev/md126 reversed, but
    also wrote those wrong identifiers, 126 and 127, into the "preferred
    minor" field of the partitions' super blocks. In essence, they got
    swapped.

    Hence trying to boot up into my normal system, /dev/md126, the root
    partition, was an unformatted empty space on the SSD.

    I don't blame the rescue disk for this occurrence. For some reason,
    when the kernel assembles /dev/md devices, it only seems to pay
    attention to the "preferred minor" fields when they are wrong. :-(

    mdadm appears to write the "preferred minor" fields at random when
    assembling the RAID arrays. I don't think it should, unless explicitly
    asked. There is an argument to mdadm which specifies the writing of
    these fields. In fact I used this to effect a repair, ironically
    enough, from the rescue disk booted with the option to suppress the
    automatic assembly of the arrays.

    Just for the record, all my RAID arrays have metadata version 0.90, the
    (old fashioned) one that allows auto-assembly by the kernel without the
    need of an initramfs.

    The moral of the story: if your system uses software RAID, be careful
    indeed before you boot up with a rescue disk.

    --
    Alan Mackenzie (Nuremberg, Germany).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From karl@aspodata.se@21:1/5 to All on Fri Dec 20 16:00:01 2024
    Alan Mackenzie:
    ...
    The cause was me booting up the machine with a rescue disk. This
    assembled my RAID partitions /dev/md127 and /dev/md126 reversed, but
    also wrote those wrong identifiers, 126 and 127, into the "preferred
    minor" field of the partitions' super blocks. In essence, they got
    swapped.
    ...
    Just for the record, all my RAID arrays have metadata version 0.90, the
    (old fashioned) one that allows auto-assembly by the kernel without the
    need of an initramfs.

    The moral of the story: if your system uses software RAID, be careful
    indeed before you boot up with a rescue disk.

    So, why don't you simple add "root=902 md=2,/dev/sda2,/dev/sdb2" or similar to your boot loader kernel command line ?

    ///

    And... what is the need for dynamic minors now when dev_t is 32bits:

    $ grep dev_t /Net/git/linux-stable/include/linux/types.h
    typedef u32 __kernel_dev_t;
    typedef __kernel_dev_t dev_t;
    $

    and we have 20 bits minors:

    $ grep -A1 MINORBITS /Net/git/linux-stable/include/linux/kdev_t.h
    #define MINORBITS 20
    #define MINORMASK ((1U << MINORBITS) - 1)

    #define MAJOR(dev) ((unsigned int) ((dev) >> MINORBITS))
    #define MINOR(dev) ((unsigned int) ((dev) & MINORMASK))
    #define MKDEV(ma,mi) (((ma) << MINORBITS) | (mi))

    Regards,
    /Karl Hammar

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alan Mackenzie@21:1/5 to karl@aspodata.se on Fri Dec 20 16:30:01 2024
    Hello, Karl.

    On Fri, Dec 20, 2024 at 15:50:53 +0100, karl@aspodata.se wrote:
    Alan Mackenzie:
    ...
    The cause was me booting up the machine with a rescue disk. This
    assembled my RAID partitions /dev/md127 and /dev/md126 reversed, but
    also wrote those wrong identifiers, 126 and 127, into the "preferred
    minor" field of the partitions' super blocks. In essence, they got swapped.
    ...
    Just for the record, all my RAID arrays have metadata version 0.90, the (old fashioned) one that allows auto-assembly by the kernel without the need of an initramfs.

    The moral of the story: if your system uses software RAID, be careful indeed before you boot up with a rescue disk.

    So, why don't you simple add "root=902 md=2,/dev/sda2,/dev/sdb2" or similar to
    your boot loader kernel command line ?

    Because I didn't know about it. I found out about it this morning, and immediately tested it by setting up an
    "md=126,/dev/nvme0n1p4,/dev/nvme1n1p4" on the kernel command line, using
    the rescue disk to make the "preferred minor"s wrong, and testing it.
    It worked!

    If I understand things correctly, with this mechanism one can have the
    kernel assemble the RAID arrays at boot up time with a modern metadata,
    but still without needing the initramfs. My arrays are still at
    metadata 0.90.

    ///

    And... what is the need for dynamic minors now when dev_t is 32bits:

    Dynamic minors? I don't think I follow you, here.

    $ grep dev_t /Net/git/linux-stable/include/linux/types.h
    typedef u32 __kernel_dev_t;
    typedef __kernel_dev_t dev_t;
    $

    and we have 20 bits minors:

    $ grep -A1 MINORBITS /Net/git/linux-stable/include/linux/kdev_t.h
    #define MINORBITS 20
    #define MINORMASK ((1U << MINORBITS) - 1)

    #define MAJOR(dev) ((unsigned int) ((dev) >> MINORBITS))
    #define MINOR(dev) ((unsigned int) ((dev) & MINORMASK))
    #define MKDEV(ma,mi) (((ma) << MINORBITS) | (mi))

    Regards,
    /Karl Hammar

    --
    Alan Mackenzie (Nuremberg, Germany).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From karl@aspodata.se@21:1/5 to All on Fri Dec 20 18:50:01 2024
    Alan Mackenzie:
    On Fri, Dec 20, 2024 at 15:50:53 +0100, karl@aspodata.se wrote:
    ...
    Because I didn't know about it. I found out about it this morning, and immediately tested it by setting up an
    "md=126,/dev/nvme0n1p4,/dev/nvme1n1p4" on the kernel command line, using
    the rescue disk to make the "preferred minor"s wrong, and testing it.
    It worked!

    If I understand things correctly, with this mechanism one can have the
    kernel assemble the RAID arrays at boot up time with a modern metadata,
    but still without needing the initramfs. My arrays are still at
    metadata 0.90.

    Please tell if you make booting with metadata 1.2 work.
    I havn't tested that.

    ///

    ...
    And... what is the need for dynamic minors now when dev_t is 32bits:
    Dynamic minors? I don't think I follow you, here.

    If you partition the md device, the partitions will get a device with a
    dynamic minor.

    # mdadm -C /dev/md11 -n 1 -l 1 --force /dev/sdc2
    # mdadm -C /dev/md10 -n 1 -l 1 -e 0 --force /dev/sdc1
    ... create partitions
    # fdisk -l /dev/md10
    ...
    Device Boot Start End Sectors Size Id Type
    /dev/md10p1 2048 22527 20480 10M 83 Linux
    /dev/md10p2 22528 192383 169856 82.9M 83 Linux
    # fdisk -l /dev/md11
    ...
    Device Boot Start End Sectors Size Id Type
    /dev/md11p1 2048 206847 204800 100M 83 Linux
    /dev/md11p2 206848 1757183 1550336 757M 83 Linux
    # cat /sys/block/md10/md10p1/dev
    259:0
    # cat /sys/block/md10/md10p2/dev
    259:1
    # cat /sys/block/md11/md11p1/dev
    259:2
    # cat /sys/block/md11/md11p2/dev
    259:3

    $ grep -A2 '259 block' /Net/git/linux-stable/Documentation/admin-guide/devices.txt
    259 block Block Extended Major
    Used dynamically to hold additional partition minor
    numbers and allow large numbers of partitions per device

    So, to boot to a md device partition (as /) might be a hit and miss
    unless you use some initramfs magic.

    Regards,
    /Karl Hammar

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?iso-8859-1?Q?Ho=EBl_B=E9zier?=@21:1/5 to All on Fri Dec 20 21:40:01 2024
    Am Fr, Dez 20, 2024 am 08:19:55 +0000 schrieb Alan Mackenzie:
    By the way, do you know an easy way for copying an entire filesystem,
    such as the root system, but without copying other systems mounted in
    it? I tried for some while with rsync and various combinations of
    find's and xargs's, and in the end booted up into the rescue disc to do
    it. I shouldn't have to do that.

    rsync -x / /some-other-place

    From man rsync:
    --one-file-system, -x don’t cross filesystem boundaries

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alan Mackenzie@21:1/5 to karl@aspodata.se on Fri Dec 20 21:30:01 2024
    Hello, Karl.

    On Fri, Dec 20, 2024 at 18:44:53 +0100, karl@aspodata.se wrote:
    Alan Mackenzie:
    On Fri, Dec 20, 2024 at 15:50:53 +0100, karl@aspodata.se wrote:
    ...
    Because I didn't know about it. I found out about it this morning, and immediately tested it by setting up an "md=126,/dev/nvme0n1p4,/dev/nvme1n1p4" on the kernel command line, using the rescue disk to make the "preferred minor"s wrong, and testing it.
    It worked!

    If I understand things correctly, with this mechanism one can have the kernel assemble the RAID arrays at boot up time with a modern metadata,
    but still without needing the initramfs. My arrays are still at
    metadata 0.90.

    Please tell if you make booting with metadata 1.2 work.
    I havn't tested that.

    I've just tried it, with metadata 1.2, and it doesn't work. I got error messages at boot up to the effect that the component partitions were
    lacking valid version 0.0 super blocks.

    People without initramfs appear not to be in the sights of the
    maintainers of this software. They could so easily have made the
    assembly of metadata 1.2 components on the kernel command line work.
    :-(

    By the way, do you know an easy way for copying an entire filesystem,
    such as the root system, but without copying other systems mounted in
    it? I tried for some while with rsync and various combinations of
    find's and xargs's, and in the end booted up into the rescue disc to do
    it. I shouldn't have to do that.

    ///

    ...
    And... what is the need for dynamic minors now when dev_t is 32bits:
    Dynamic minors? I don't think I follow you, here.

    If you partition the md device, the partitions will get a device with a dynamic minor.

    # mdadm -C /dev/md11 -n 1 -l 1 --force /dev/sdc2
    # mdadm -C /dev/md10 -n 1 -l 1 -e 0 --force /dev/sdc1
    ... create partitions
    # fdisk -l /dev/md10
    ...
    Device Boot Start End Sectors Size Id Type
    /dev/md10p1 2048 22527 20480 10M 83 Linux
    /dev/md10p2 22528 192383 169856 82.9M 83 Linux
    # fdisk -l /dev/md11
    ...
    Device Boot Start End Sectors Size Id Type
    /dev/md11p1 2048 206847 204800 100M 83 Linux
    /dev/md11p2 206848 1757183 1550336 757M 83 Linux
    # cat /sys/block/md10/md10p1/dev
    259:0
    # cat /sys/block/md10/md10p2/dev
    259:1
    # cat /sys/block/md11/md11p1/dev
    259:2
    # cat /sys/block/md11/md11p2/dev
    259:3

    $ grep -A2 '259 block' /Net/git/linux-stable/Documentation/admin-guide/devices.txt
    259 block Block Extended Major
    Used dynamically to hold additional partition minor
    numbers and allow large numbers of partitions per device

    So, to boot to a md device partition (as /) might be a hit and miss
    unless you use some initramfs magic.

    OK, thanks for the explanation. My root partition is an entire device, /dev/md126. I've only had problems with it when accidents have
    happened, like yesterday evening.

    Regards,
    /Karl Hammar

    --
    Alan Mackenzie (Nuremberg, Germany).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alan Mackenzie@21:1/5 to All on Fri Dec 20 22:00:01 2024
    Hello, Hoël.

    On Fri, Dec 20, 2024 at 21:38:42 +0100, Hoël Bézier wrote:
    Am Fr, Dez 20, 2024 am 08:19:55 +0000 schrieb Alan Mackenzie:
    By the way, do you know an easy way for copying an entire filesystem,
    such as the root system, but without copying other systems mounted in
    it? I tried for some while with rsync and various combinations of
    find's and xargs's, and in the end booted up into the rescue disc to do
    it. I shouldn't have to do that.

    rsync -x / /some-other-place

    From man rsync:
    --one-file-system, -x don’t cross filesystem boundaries

    Thanks! I'll remember that. For some reason I didn't find it when
    searching the rsync man page.

    --
    Alan Mackenzie (Nuremberg, Germany).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From karl@aspodata.se@21:1/5 to All on Fri Dec 20 23:10:02 2024
    Alan Mackenzie:
    ...
    By the way, do you know an easy way for copying an entire filesystem,
    such as the root system, but without copying other systems mounted in
    it? I tried for some while with rsync and various combinations of
    find's and xargs's, and in the end booted up into the rescue disc to do
    it. I shouldn't have to do that.

    rsync as other people have suggested.
    There is also
    cp -x
    dump/restore
    find -xdev
    etc.

    You can also do it by accessing the /dev/-file like
    dd if=source of=dest (cp works here also but dd is more the norm).

    ///

    When something is mounted on a mount point, the files below the
    mount point is hidden and the mounted filessystem will be available
    instead. Do you want to copy thoose hidden files also ?

    Regards,
    /Karl Hammar

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alan Mackenzie@21:1/5 to karl@aspodata.se on Sat Dec 21 13:50:01 2024
    Hello, Karl.

    On Fri, Dec 20, 2024 at 23:02:58 +0100, karl@aspodata.se wrote:
    Alan Mackenzie:
    On Fri, Dec 20, 2024 at 18:44:53 +0100, karl@aspodata.se wrote:
    ...
    Please tell if you make booting with metadata 1.2 work.
    I havn't tested that.

    I've just tried it, with metadata 1.2, and it doesn't work. I got error messages at boot up to the effect that the component partitions were lacking valid version 0.0 super blocks.

    People without initramfs appear not to be in the sights of the
    maintainers of this software. They could so easily have made the
    assembly of metadata 1.2 components on the kernel command line work.
    :-(
    ...

    The cmd line handling and auto mounting seems to be handled in files
    like (depending of kernel version I guess):
    drivers/md/md-autodetect.c
    init/do_mounts_md.c
    you can find the correct file with
    find <kernel top dir> -type f -name \*.c | xargs grep MD_AUTODETECT

    The pertinent functions are mainly in drivers/md/md-autodetect.c and
    md.c (same directory).

    It seems that nowhere does this code try the different metadata formats
    in turn, using the first valid one that it finds. Instead, it expects
    the metadata format to be passed in as an argument to whatever needs it.
    For the md kernel parameter to be able to load metadata versions
    1.[012], the parameter definition would have to be enhanced, somehow.
    Something like:

    md=124,1.2,/dev/nvme0n1p6,/dev/nvme1n1p6
    ^^^^

    , where the extra bit is optional. This enhancement would not be
    difficult. The trouble is more political. I think this code is
    maintained by RedHat. RedHat's customers all use initramfs, so they
    probably think everybody else should, too, hence would be unwilling to
    enhance it for a small group of Gentooers.

    The problem might be that in format 1.2, the superblock is at 4K from
    start, could format 1.1 (where the superblock is at start) work ?

    This doesn't seem to be the problem. The 0.90 superblock is right at
    the end of the partition, for example. There are two functions in md.c, super_90_load and super_1_load which read and verify the super block of
    the given metadata type.

    Despite the 0.90 format being "deprecated", it doesn't appear to be in
    any danger. It was in a deprecated state in 2010, when I started using
    RAID, and I think the maintainers realise that to phase 0.90 out would
    cause a lot of pain and protest. The main limitation with 0.90 that I
    can see is its restriction to 2^32 512-byte blocks per component device.
    This is the 2 terabyte limitation, which isn't a problem for me at the
    moment, but might be for other people with enormous drives.

    Nevertheless, I might make the above enhancement, just because.

    Regards,
    /Karl Hammar

    --
    Alan Mackenzie (Nuremberg, Germany).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From karl@aspodata.se@21:1/5 to All on Sat Dec 21 17:50:01 2024
    Alan Mackenzie:
    ...
    I've now got working code which assembles a metadata 1.2 RAID array at
    boot time. The syntax needed on the command line is, again,

    md=124,1.2,/dev/nvme0n1p6,/dev/nvme1n1p6

    .. In place of 1.2 can be any of 0.90, 1.0, 1.1, though I haven't tested
    it with anything but 1.2 as yet.
    ...

    Fun! Which kernel, can you send a patch ?

    Regards,
    /Karl Hammar

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alan Mackenzie@21:1/5 to Alan Mackenzie on Sat Dec 21 17:40:01 2024
    Hello again, Karl.

    On Sat, Dec 21, 2024 at 12:43:50 +0000, Alan Mackenzie wrote:
    On Fri, Dec 20, 2024 at 23:02:58 +0100, karl@aspodata.se wrote:
    Alan Mackenzie:
    On Fri, Dec 20, 2024 at 18:44:53 +0100, karl@aspodata.se wrote:
    ...
    Please tell if you make booting with metadata 1.2 work.
    I havn't tested that.

    I've just tried it, with metadata 1.2, and it doesn't work. I got error messages at boot up to the effect that the component partitions were lacking valid version 0.0 super blocks.

    People without initramfs appear not to be in the sights of the maintainers of this software. They could so easily have made the assembly of metadata 1.2 components on the kernel command line work.
    :-(
    ...

    I've now got working code which assembles a metadata 1.2 RAID array at
    boot time. The syntax needed on the command line is, again,

    md=124,1.2,/dev/nvme0n1p6,/dev/nvme1n1p6

    .. In place of 1.2 can be any of 0.90, 1.0, 1.1, though I haven't tested
    it with anything but 1.2 as yet.

    The pertinent functions are mainly in drivers/md/md-autodetect.c and
    md.c (same directory).

    Actually, just in md-autodetect.c.

    [ .... ]

    Nevertheless, I might make the above enhancement, just because.

    Done.

    Regards,
    /Karl Hammar

    --
    Alan Mackenzie (Nuremberg, Germany).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alan Mackenzie@21:1/5 to karl@aspodata.se on Sat Dec 21 18:00:01 2024
    Hello, Karl.

    On Sat, Dec 21, 2024 at 17:45:13 +0100, karl@aspodata.se wrote:
    Alan Mackenzie:
    ...
    I've now got working code which assembles a metadata 1.2 RAID array at
    boot time. The syntax needed on the command line is, again,

    md=124,1.2,/dev/nvme0n1p6,/dev/nvme1n1p6

    .. In place of 1.2 can be any of 0.90, 1.0, 1.1, though I haven't tested it with anything but 1.2 as yet.
    ...

    Fun! Which kernel, can you send a patch ?

    6.6.62. Patch enclosed. It should apply cleanly from the directory ..../drivers/md.

    Have fun!

    Regards,
    /Karl Hammar

    --
    Alan Mackenzie (Nuremberg, Germany).


    diff --git a/drivers/md/md-autodetect.c b/drivers/md/md-autodetect.c
    index b2a00f213c2c..2cd347108284 100644
    --- a/drivers/md/md-autodetect.c
    +++ b/drivers/md/md-autodetect.c
    @@ -124,6 +124,17 @@ static void __init md_setup_drive(struct md_setup_args *args)
    struct mddev *mddev;
    int err = 0, i;
    char name[16];
    + int major_version = 0, minor_version = 90;
    + char *pp;
    + static struct {
    + char *metadata;
    + int major_version;
    + int minor_version;
    + } metadata_table[] =
    + {{"0.90", 0, 90},
    + {"1.0", 1, 0},
    + {"1.1", 1, 1},
    + {"1.2", 1, 2}};

    if (args->partitioned) {
    mdev = MKDEV(mdp_major, args->minor << MdpMinorShift);
    @@ -133,6 +144,21 @@ static void __init md_setup_drive(struct md_setup_args *args)
    sprintf(name, "md%d", args->minor);
    }

    + pp = strchr(devname, ',');
    + if (pp)
    + {
    + *pp = 0;
    + for (i = 1; i < ARRAY_SIZE(metadata_table); i++)
    + if (!strcmp(devname, metadata_table[i].metadata))
    + {
    + major_version = metadata_table[i].major_version;
    + minor_version = metadata_table[i].minor_version;
    + devname = pp + 1;
    + break;
    + }
    + *pp = ',';
    +
  • From Wols Lists@21:1/5 to Alan Mackenzie on Sun Dec 22 13:10:01 2024
    On 20/12/2024 20:19, Alan Mackenzie wrote:
    I've just tried it, with metadata 1.2, and it doesn't work. I got error messages at boot up to the effect that the component partitions were
    lacking valid version 0.0 super blocks.

    People without initramfs appear not to be in the sights of the
    maintainers of this software. They could so easily have made the
    assembly of metadata 1.2 components on the kernel command line work.
    🙁
    No they couldn't. Not if they wanted (at the time) a kernel small enough
    to boot successfully ...

    Making the disk write write identically to two disks (your basic 0.9
    mirror) is pretty simple, and also extremely error prone. Making mdraid
    robust with all the other features of an enterprise "protect your data"
    system is a lot more work.

    mdraid has probably just protected my data - dunno what triggered it,
    but I lost a disk and it just got rebuilt in the background without me
    doing a thing ...

    By the way, do you know an easy way for copying an entire filesystem,
    such as the root system, but without copying other systems mounted in
    it? I tried for some while with rsync and various combinations of
    find's and xargs's, and in the end booted up into the rescue disc to do
    it. I shouldn't have to do that.

    Provided it's read-only (so yes if it's the root I might well use a
    rescue disk) I'd use dd. That's assuming a fairly small root that's
    fairly full, it's rather wasteful if it's not ...

    Cheers,
    Wol

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Wols Lists@21:1/5 to karl@aspodata.se on Sun Dec 22 13:10:01 2024
    On 20/12/2024 17:44, karl@aspodata.se wrote:
    If I understand things correctly, with this mechanism one can have the
    kernel assemble the RAID arrays at boot up time with a modern metadata,
    but still without needing the initramfs. My arrays are still at
    metadata 0.90.

    Please tell if you make booting with metadata 1.2 work.
    I havn't tested that.

    It is NOT supported. The kernel has no code to do so, you need an
    initramfs. That said, nowadays I believe you can actually load the
    initramfs into the kernel so it's one monolithic blob ...

    By the way, as to the other point of putting /dev/sda etc on the kernel
    command line, it's the kernel that's messing up and scrambling which
    physical disk is which logical sda sdb et al device, so explicitly
    specifying that will have exactly NO effect when your hardware/software
    combo changes again. I guess it was the fact your rescue disk booted
    from CDROM or whatever made THAT sda, and pushed the other disks out of
    the way.

    sda, sdb, sdc et al are allocated AT RANDOM by the kernel. It just so
    happens that the "seed" rarely changes, so in normal use the same values
    happen to get chosen every time - until something DOES change, and then
    you wonder why everything falls over. The same is also true of md127,
    md126 et al. If your raid counts up from md1, md2 etc then those I
    believe are stable, but I haven't seen them for pretty much the entire
    time I've been involved in mdraid (maybe a decade or so?)

    You need to use those UUID/GUID things. I know it's a hassle finding out whether it's a guid or a uuid, and what it is, and all that crud, but
    trust me they don't change, you can shuffle your disks, stick in another
    SATA card, move it from SATA to USB (BAD move - don't even think of it
    !!!), and the system will still find the correct disk.

    Cheers,
    Wol

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alan Mackenzie@21:1/5 to Alan Mackenzie on Sun Dec 22 14:10:01 2024
    Hello again!

    On Sat, Dec 21, 2024 at 16:58:59 +0000, Alan Mackenzie wrote:
    Hello, Karl.

    On Sat, Dec 21, 2024 at 17:45:13 +0100, karl@aspodata.se wrote:
    Alan Mackenzie:
    ...
    I've now got working code which assembles a metadata 1.2 RAID array at boot time. The syntax needed on the command line is, again,

    md=124,1.2,/dev/nvme0n1p6,/dev/nvme1n1p6

    .. In place of 1.2 can be any of 0.90, 1.0, 1.1, though I haven't tested it with anything but 1.2 as yet.
    ...

    Fun! Which kernel, can you send a patch ?

    6.6.62. Patch enclosed. It should apply cleanly from the directory ..../drivers/md.

    There was an error in yesterday's patch. For some reason I can't
    fathom, I'd started a loop with

    for (i = 1; ....)

    in place of the correct

    for (i = 0; ....)

    .. The consequence was that the driver would not recognise "0.90" when
    given explicitly in the kernel command line, for example as

    md=126,0.90,/dev/nvme0n1p4,/dev/nvme1n1p4

    .. Please use the enclosed patch in place of that patch from yesterday.

    Thanks!

    Have fun!

    Regards,
    /Karl Hammar

    --
    Alan Mackenzie (Nuremberg, Germany).


    diff --git a/drivers/md/md-autodetect.c b/drivers/md/md-autodetect.c
    index b2a00f213c2c..6bd6e9177969 100644
    --- a/drivers/md/md-autodetect.c
    +++ b/drivers/md/md-autodetect.c
    @@ -124,6 +124,17 @@ static void __init md_setup_drive(struct md_setup_args *args)
    struct mddev *mddev;
    int err = 0, i;
    char name[16];
    + int major_version = 0, minor_version = 90;
    + char *pp;
    + static struct {
    + char *metadata;
    + int major_version;
    + int minor_version;
    + } metadata_table[] =
    + {{"0.90", 0, 90},
    + {"1.0", 1, 0},
    + {"1.1", 1, 1},
    + {"1.2", 1, 2}};

    if (args->partitioned) {
    mdev = MKDEV(mdp_major, args->minor << MdpMinorShift);
    @@ -133,6 +144,21 @@ static void __init md_setup_drive(struct md_setup_args *args)
    sprintf(name, "md%d", args->minor);
    }

    + pp = strchr(devname, ',');
    + if (pp)
    + {
    + *pp = 0;
    + for (i = 0; i < ARRAY_SIZE(metadata_table); i++)
    + if (!strcmp(devname, metadata_table[i].metadata))
    + {
    + major_version = metadata_table[i].major_version;
    + minor_version = metadata_table[i].minor_version;
    + devname = pp + 1;
    + break;
    + }
    + *pp = ','
  • From Wols Lists@21:1/5 to Alan Mackenzie on Sun Dec 22 13:20:01 2024
    On 21/12/2024 12:43, Alan Mackenzie wrote:
    , where the extra bit is optional. This enhancement would not be
    difficult. The trouble is more political. I think this code is
    maintained by RedHat. RedHat's customers all use initramfs, so they
    probably think everybody else should, too, hence would be unwilling to enhance it for a small group of Gentooers.

    Let's blame RedHat again ... I think you're wrong.

    There's a fair few SUSE people in there. The person who did nearly all
    of the heavy lifting before he stepped down was SuSE. A lot of the
    "senior" team (just a couple of people, as per normal) are Far Eastern,
    I'm not sure of their company affiliation.

    About the only person I'm confident IS RedHat is the guy maintaining
    mdadm, which is not mdraid (it's the management tool, not the "do the
    work" tool).

    Cheers,
    Wol

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alan Mackenzie@21:1/5 to Wols Lists on Sun Dec 22 14:50:01 2024
    Hello, Wol.

    On Sun, Dec 22, 2024 at 12:02:49 +0000, Wols Lists wrote:
    On 20/12/2024 17:44, karl@aspodata.se wrote:
    If I understand things correctly, with this mechanism one can have the
    kernel assemble the RAID arrays at boot up time with a modern metadata,
    but still without needing the initramfs. My arrays are still at
    metadata 0.90.

    Please tell if you make booting with metadata 1.2 work.
    I havn't tested that.

    It is NOT supported. The kernel has no code to do so, you need an
    initramfs. That said, nowadays I believe you can actually load the
    initramfs into the kernel so it's one monolithic blob ...

    With my patch from yesterday (corrected today), you can indeed instruct
    the kernel to assemble RAID devices with metadata 1.2. It wasn't a
    difficult patch by any means. One wonders why the md kernel team hadn't
    done it a long time ago.

    initramfs's are ugly ungainly things, often many times larger than the
    kernel itself, and appear not to have been well thought out. They are
    surely a source of complication and error, and are best avoided, if
    possible. I've never actually built one myself, and will go to some
    lengths, like hacking the kernel, to avoid it.

    By the way, as to the other point of putting /dev/sda etc on the kernel command line, it's the kernel that's messing up and scrambling which
    physical disk is which logical sda sdb et al device, so explicitly
    specifying that will have exactly NO effect when your hardware/software
    combo changes again.

    /dev/sda (or, in my case, /dev/nvme0n1), etc. don't, in my experience,
    get scrambled by the kernel. They're plugged into the same sockets on
    the motherboard from day to day, so unless you're physically inserting
    or removing them, you won't have trouble.

    I guess it was the fact your rescue disk booted from CDROM or whatever
    made THAT sda, and pushed the other disks out of the way.

    No, you've misunderstood my situation. What got scrambled by the rescue
    disc was the assignment of /dev/md127 and /dev/md126. This has been
    solved by explicitly specifying the assignment with md parameters in the
    kernel command line. So now my system boots just fine, even after the assignment of the devices (the "preferred-minor" field in the MD
    superblock) has been scrambled by the rescue disk.

    sda, sdb, sdc et al are allocated AT RANDOM by the kernel.

    Only in the sense that it may be difficult on a new machine to predict
    in advance which physical HDD becomes which sdx. As I said, the
    assignment of physical drives to logical devices is repeatable, and
    doesn't change from day to day.

    It just so happens that the "seed" rarely changes, so in normal use
    the same values happen to get chosen every time - until something DOES change, and then you wonder why everything falls over. The same is
    also true of md127, md126 et al. If your raid counts up from md1, md2
    etc then those I believe are stable, but I haven't seen them for
    pretty much the entire time I've been involved in mdraid (maybe a
    decade or so?)

    You need to use those UUID/GUID things. I know it's a hassle finding out whether it's a guid or a uuid, and what it is, and all that crud, but
    trust me they don't change, you can shuffle your disks, stick in another
    SATA card, move it from SATA to USB (BAD move - don't even think of it
    !!!), and the system will still find the correct disk.

    The trouble being that a kernel command line, or /etc/fstab, using lots
    of these is not human readable, and hence is at the edge of
    unmaintainability. This maintenance difficulty surely outweighs the
    rare situation where the physical->logical assignment changes due to a
    broken drive. That's what we've got rescue disks for.

    Cheers,
    Wol

    --
    Alan Mackenzie (Nuremberg, Germany).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Humphrey@21:1/5 to All on Sun Dec 22 16:30:01 2024
    On Sunday 22 December 2024 13:43:08 GMT Alan Mackenzie wrote:

    The trouble [is] that a kernel command line, or /etc/fstab, using lots
    of these is not human readable, and hence is at the edge of unmaintainability. This maintenance difficulty surely outweighs the
    rare situation where the physical->logical assignment changes due to a
    broken drive. That's what we've got rescue disks for.

    Hear, hear! I never could understand why everyone seems to want to jump onto that band-wagon.

    --
    Regards,
    Peter.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Wols Lists@21:1/5 to Peter Humphrey on Sun Dec 22 18:00:07 2024
    On 22/12/2024 15:29, Peter Humphrey wrote:
    On Sunday 22 December 2024 13:43:08 GMT Alan Mackenzie wrote:

    The trouble [is] that a kernel command line, or /etc/fstab, using lots
    of these is not human readable, and hence is at the edge of
    unmaintainability. This maintenance difficulty surely outweighs the
    rare situation where the physical->logical assignment changes due to a
    broken drive. That's what we've got rescue disks for.

    Hear, hear! I never could understand why everyone seems to want to jump onto that band-wagon.

    I have no problem with you saying all this long guid crap makes stuff unreadable (and yes, I agree, unreadable and unmaintainable aren't that
    far different) BUT

    surely outweighs the rare situation where the physical->logical
    assignment changes

    THAT DEPENDS ON YOUR HARDWARE!

    For normal consumer grade hardware, I agree. I've never known it change
    unless I've been mucking about with add-in SATA, PATA, whatever cards.

    BUT. Especially on big server-grade hardware, where there's lots of trip switches so stuff doesn't all power up in one huge spike (and I've
    worked with such), different parts of the system come up in a completely
    random order, and drives re-order themselves pretty much every single boot!

    So yes, with our consumer hardware I'd agree with you. But the people
    paying big bills for reliable top-range hardware would wonder what
    you're smoking!

    Cheers,
    Wol

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alan Mackenzie@21:1/5 to Wols Lists on Sun Dec 22 21:10:01 2024
    Hello, Wol.

    On Sun, Dec 22, 2024 at 16:53:17 +0000, Wols Lists wrote:
    On 22/12/2024 15:29, Peter Humphrey wrote:
    On Sunday 22 December 2024 13:43:08 GMT Alan Mackenzie wrote:

    The trouble [is] that a kernel command line, or /etc/fstab, using lots
    of these is not human readable, and hence is at the edge of
    unmaintainability. This maintenance difficulty surely outweighs the
    rare situation where the physical->logical assignment changes due to a
    broken drive. That's what we've got rescue disks for.

    Hear, hear! I never could understand why everyone seems to want to jump onto
    that band-wagon.

    I have no problem with you saying all this long guid crap makes stuff unreadable (and yes, I agree, unreadable and unmaintainable aren't that
    far different) BUT

    surely outweighs the rare situation where the physical->logical
    assignment changes

    THAT DEPENDS ON YOUR HARDWARE!

    For normal consumer grade hardware, I agree. I've never known it change unless I've been mucking about with add-in SATA, PATA, whatever cards.

    This is the desirable state of affairs.

    BUT. Especially on big server-grade hardware, where there's lots of trip switches so stuff doesn't all power up in one huge spike (and I've
    worked with such), different parts of the system come up in a completely random order, and drives re-order themselves pretty much every single boot!

    So all this 32 hex digit UUID stuff is a workaround for the
    unpredictability of server hardware. What seems to be missing is a way
    of associating a given disk socket on the motherboard with /dev/sda.
    Instead we have to put up with "content addressing".

    So yes, with our consumer hardware I'd agree with you. But the people
    paying big bills for reliable top-range hardware would wonder what
    you're smoking!

    I think any system admins reading this would long for the predictability
    of "consumer hardware", having too often been confronted with
    indistinguishable 32 hex digit identifiers. I would imagine it quite
    likely that the said admins have written scripts to make this more
    manageable.

    Cheers,
    Wol

    --
    Alan Mackenzie (Nuremberg, Germany).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Steinmetzger@21:1/5 to All on Mon Dec 30 05:20:01 2024
    Am Fri, Dec 20, 2024 at 11:02:55PM +0100 schrieb karl@aspodata.se:
    Alan Mackenzie:
    ...
    By the way, do you know an easy way for copying an entire filesystem,
    such as the root system, but without copying other systems mounted in
    it? I tried for some while with rsync and various combinations of
    find's and xargs's, and in the end booted up into the rescue disc to do
    it. I shouldn't have to do that.

    rsync as other people have suggested.
    There is also
    cp -x
    dump/restore
    find -xdev
    etc.

    You can also do it by accessing the /dev/-file like
    dd if=source of=dest (cp works here also but dd is more the norm).

    ///

    When something is mounted on a mount point, the files below the
    mount point is hidden and the mounted filessystem will be available
    instead. Do you want to copy thoose hidden files also ?

    To circumnavigate this, I usually bind-mount the filesystem to another directory first. I usually only do this when I’m dealing with /, as my FS structure is not complex:

    mount --bind / /mnt/bind
    rsync -axAHX /mnt/bind/ /path/to/destination/
    (-x is not needed then, but it’s part of muscle memory)

    --
    Grüße | Greetings | Salut | Qapla’
    Please do not share anything from, with or about me on any social network.

    Keyboard not connected, press F1 to continue.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Steven Lembark@21:1/5 to All on Wed Dec 25 22:20:01 2024
    I think any system admins reading this would long for the
    predictability of "consumer hardware", having too often been
    confronted with indistinguishable 32 hex digit identifiers. I would
    imagine it quite likely that the said admins have written scripts to
    make this more manageable.

    Simple fix: use LVM, let it deal with the UUID. At that point
    the PV's get UUID's, the VG's get UUID's, the LV's get UUID's
    and you never have to type or see or use them.

    Snippet from my /etc/fstab:

    /dev/vg00/root / xfs ...
    /dev/vg00/var /var xfs ...
    /dev/vg00/var-tmp /var/tmp xfs ...

    this is basically the same fstab on my server & notebook, hasn't
    changed in the transitions from ATA to SATA to SCSI to SAS to
    NVME.

    If you want mirroring then either create a mirror with mdadm
    and use it as a PV -- kenel will auto-start the mirror, vgscan
    will find it, and Viola!, it's up -- or use -m2 and mirror/stripe/ RAID5/whatever using lvcreate to spread the data across whatever
    you like.

    Here I have two nvme's (used to be scsi, then sas) which are mirrored
    for vg00 w/ the root, var, home filesystems another that's striped
    for /var/tmp and other scratch spaces.

    This gives an overview:

    https://speakerdeck.com/lembark/its-only-logical-lvm-for-linux

    --
    Steven Lembark
    Workhorse Computing
    lembark@wrkhors.com
    +1 888 359 3508

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)