Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad initrd generation for non-default snapshot when using systemd-boot and dracut modules: mdraid, dracut-sshd #136

Open
prawilny opened this issue Dec 29, 2024 · 16 comments · May be fixed by openSUSE/sdbootutil#183

Comments

@prawilny
Copy link

prawilny commented Dec 29, 2024

Hello,
I seem to have encountered a peculiar problem: a system upgrade caused by timer-triggered transactional-update.service generated an initrd that was missing some dracut modules (in particular: sshd and mdraid).

After the problematic update, I booted the system, entered the password using the console rather than remotely, found the currently used initrd, and dumped output of lsinitrd when specifying it as an argument.
The (bad) result:
lsinitrd.bad.txt

Then I just ran transactional-update initrd and it generated the initrd that contained the missing modules:
The (good) result:
lsinitrd.good.txt

Note that both initrds seem to have been generated with the same dracut command (Arguments: --quiet --reproducible --force --tmpdir '/var/tmp' in the log files).

After a reboot, the module was present and I managed to succesfully use SSH to decrypt the drive.

My setup:

  • openSUSE MicroOS 20241220 (it's the update that seems to have broken the system) x86_64, UEFI
  • systemd-boot as the bootloader
  • partition layout:
    • /: BTRFS RAID1 setup on two LUKS-encrypted partitions of two SSDs
    • /boot: ext4 mdadm-RAID1 setup on the same (but unencrypted) drives
    • /boot/efi: single drive vfat on one of the drives
  • https://github.com/gsauthof/dracut-sshd for remote decryption of LUKS-encrypted drives:

Is the whole issue caused by some misconfiguration I did?
How can I check it?
I already checked that when using transactional-update shell, bash sees /usr/lib/dracut/modules.d which is where sshd is stored within the system.

Where should I look for the documentation that could help me puzzle it out?
Mainly I'd appreciate pointing me to the component that is likely to be the one calling dracut and/or ideally some documentation/explanation what parts of the filesystem that caller should see.

Please point me to a better place for such a request for help if here isn't an appropriate one.
Of course, I can provide some more logs if they are needed.

edit: I messed up attaching logs, fixed it now.

@prawilny
Copy link
Author

prawilny commented Jan 2, 2025

I found some time to look at the sources of this project, related sdbootutil hooks, and suse-module-tools and couldn't find any suspicious code that would seem likely to cause the strange behavior.
For now, I'll just live without restarting my homeserver, hoping to catch the problem the next time it happens.

Still, I'd greatly appreciate some pointers where to look.

@aplanas
Copy link
Collaborator

aplanas commented Jan 2, 2025

Hi @prawilny, sorry for the delay. This dates is always a bit more complicated.

You are right, the one that generates the dracut call is sdbootutil. Are you always able to reproduce the issue via the transactional-update.service? I mean, my suspicious is that somehow sdbootutil is creating the initrd from the wrong place (even tho it should there is code to do the right thing), so I think this is the first place to look at.

Something that we can check is to disable the service and call transactional-update dup and check if there is a new snapshot created. This will also create a new initrd, that I would assume will be wrong (please check). The idea is to call manually bash -x sdbootutil mkinitrd to generate a new initrd directly and get the trace (logs) of the output. Ideally it should be calling dracut from a chroot situated in the new snapshot.

Do you know if the sshd and mdraid are usually included, or do you have an specific configuration in dracut.d to add them?

@prawilny
Copy link
Author

prawilny commented Jan 2, 2025

@aplanas, thank you for your response. Of course I know these days are free for many - that is the very reason why I found some time to tinker.

To answer your questions:

  • as I wrote in the post, I didn't manage to reproduce the problem reliably - it happened about two times (always after an automatic update) and I don't remember what I did the first time, but the second time just running transactional-update mkinitrd fixed the problem (it generated an appropriate entry in /boot/efi//loader/entries/opensuse-microos-6.12.6-1-default-35.conf (the path has two slashes in bootctl output, I copied it verbatim) that pointed to an initrd file with both previously missing modules present; note that before running the fix command, the same file pointed to the wrong initrd file with the modules missing).
  • as for the modules in initrd:
    • I don't what exactly is the cause of mdraid being included since it's not mentioned explicitly in neither /etc/dracut nor /etc/dracut.conf.d/* (but it makes sense - I'm running mdraid for /boot after all), my best guess would probing modules of the running system
    • sshd-dracut is packaged like this: https://build.opensuse.org/projects/openSUSE:Factory/packages/dracut-sshd/files/dracut-sshd.spec. I have /usr/lib/dracut/modules.d/46sshd directory present and filled on my machine. The documentation claims that "once present under /usr/lib/dracut/modules.d it's enabled, by default".
    • apart from that I'm adding systemd-networkd and its config in drop-in configs

I have some questions on my own:

  • What do you suspect a difference between running sdbootutil mkinitrd and in transactional-update could be? It seemed to me that the call wasn't wrapped in any kind of transaction in transactional-update script. Or did I read it wrong and missed something?
  • Do I reason sensibly that adding an ExecStart override like this: ExecStart=bash -x /usr/sbin/transactional-update cleanup ${UPDATE_METHOD} reboot would be a good idea for debugging since it's the script that calls sdbootutil? Or is my inexperience in writing unit files showing and that is a wrong move?

Also, I'll try to reproduce the problem on the next kernel upgrade (it should trigger initrd regeneration, right?) since I'm a bit afraid of rolling back to a snapshot older than today, assuming that I might've forgotten something I did in the period in between.

@aplanas
Copy link
Collaborator

aplanas commented Jan 2, 2025

What do you suspect a difference between running sdbootutil mkinitrd and in transactional-update could be?

None that I can think of. But there is a difference of sdbootutil mkinitrd when the default snapshot is the current one or is a different one. This happens when there is an update and a pending reboot. When the active snapshot is different from the default one (so the system has been updated and the reboot did not happen yet) sdbootutil mkinitrd will call dracut from a chroot. If the active and the default one are the same, this chroot is not happening.

Do I reason sensibly that adding an ExecStart override like this: ...

Yes, this will work too. But to control better the situation I think that is better if the service is disabled and you manually do the update. After the update we will be in the situation that we need to reboot to activate the new snapshot. This is the same situation when the service is running and a new initrd is created from the old snapshot.

We can try to simulate this too, booting from an old snapshot but keeping the default one as such, and trying to create the initrd for the new one from the old one.

@prawilny
Copy link
Author

prawilny commented Jan 3, 2025

What you said makes a lot of sense, so I went ahead and tried to reproduce using the new 20250102 snapshot.
I think I did succeed:

  • Yesterday I disabled transactional-update.timer
  • Today I ran manually sudo systemctl start transactional-update.service
    • I ran it from snapshot 35, it created snapshot 37 (in between I wrongly ran transactional-update not through service, but didn't finish the command and got left with a snapshot marked with created snapshot 36 that got interrupted during the update)
    • the bootloader entry of the new snapshot points to an initrd image missing sshd and mdraid.
    • no logs of this step since I did add the bash -x systemd unit override, but I forgot to run daemon-reload.
  • Then I ran two commands still in the 35 snapshot (the old one).
    • sudo bash -x /usr/bin/sdbootutil mkinitrd 2>&1 | tee -a sudo.sdbootutil. Logs
    • sudo bash -x /usr/sbin/transactional-update initrd 2>&1 | tee -a sudo.transactional.update.initrd. Logs.
    • both the commands seem to have only rewritten the initrd of the 35 (old) snapshot.

So I think the solution of the mystery of different result is simple - I didn't realize I was running the commands from different snapshots.
Note that this explanation may be wrong - I only skimmed the logs and I don't understand the whole program.

Still, there remains a problem - why is a wrong initrd image generated in the first place? Can you give me some pointers how to debug it? I think the most important hint would be to point me to the place in code that generates initrd for the new snapshot (I still feel that the sdbootutil call is done in the old snapshot context, but probably I'm missing something).

In the short term, do you have any idea for a workaround? I think a way to regenerate initrd for a new snapshot from the old one would suffice for now.

PS I also took a look at the initrds from previous snapshots and it looks like some of them do have sshd and mdraid and some don't - it looks random to me.

@aplanas
Copy link
Collaborator

aplanas commented Jan 7, 2025

@prawilny I had more time to dig into this issue:

  • As I see from the lsinitrd output that you attached in the first comment, only mdraid is the missing one, and both has sshd. Can you confirm that the issue is only in mdraid?
  • sudo bash -x /usr/bin/sdbootutil mkinitrd 2>&1 | tee -a sudo.sdbootutil. You can create initrd for 37 from 35 using sudo bash -x /usr/bin/sdbootutil mkinitrd 37 2>&1 | tee -a sudo.sdbootutil, or in a more general case sudo bash -x /usr/bin/sdbootutil --default-snapshot mkinitrd 2>&1 | tee -a sudo.sdbootutil. This is what the plugins are doing when calling sdbootutil mkinitrd
  • sudo bash -x /usr/sbin/transactional-update initrd 2>&1 | tee -a sudo.transactional.update.initrd this command will create a initrd for the current snapshot. This is because there is a difference between when this command is called when grub2-efi is used (that then a new snapshot will be created, and the new initrd will be placed there), and when systemd-boot is used (/boot/efi is now outside the snapshot). The difference is subtle but in this case explain why is created for 35

Can you confirm that sshd is missing in other initrds? As commented in the attached logs the module is present in both. I am trying to reproduce this issue with dracut-sshd but so far I am not able. I understand that you have a RAID configuration?

@prawilny
Copy link
Author

prawilny commented Jan 7, 2025

@aplanas, responding to you a point at a time:

  • sshd is listed as a module, but its files (for example, usr/sbin/sshd or etc/systemd/system/sysinit.target.wants/sshd.service) are missing - thus the fact that it's listed as a module matters little since it just doesn't work in practice.
    • reading it after writing the rest of the message - it makes sense since it's included because its check() returned 0, but then wasn't really included since its install() returned 1 before installing anything
  • yes, I have a BTRFS RAID1 for / (system), BTRFS RAID1 for /hdd (data), mdraid RAID1 (ext4) for /boot. My /boot/efi is not in any RAID.
  • thank you for you response with regard to the sdbootutil command that will regenerate initrd. I should've found it myself. Still, I hope that trace of my execution command will help you find the bug (or maybe even better, my misconfiguration).

I ran sudo bash -x /usr/bin/sdbootutil mkinitrd 37 2>&1 | tee -a sudo.sdbootutil. Here's the logs.
Here's also sudo lsinitrd /boot/efi/opensuse-microos/6.12.6-1-default/initrd-dedd57640252a91e9a750b9d4063228093d40527 output. As you can see, mdraid is still missing.
I took a look at how dracut checks whether to include it:

# part of /lib/dracut/modules.d/90mdraid/module-setup.sh

check() {
    local dev holder

    # No mdadm?  No mdraid support.
    require_binaries mdadm expr || return 1

    [[ $hostonly ]] || [[ $mount_needs ]] && {
        for dev in "${!host_fs_types[@]}"; do
            [[ ${host_fs_types[$dev]} != *_raid_member ]] && continue

            DEVPATH=$(get_devpath_block "$dev")

            for holder in "$DEVPATH"/holders/*; do
                [[ -e $holder ]] || continue
                [[ -e "$holder/md" ]] && return 0
                break
            done

        done
        return 255
    }

    return 0
}

So it seems that it's broken by chrooting to the snapshot.

As to dracut-sshd, sdbootutil output seems to be quite clear:

+ chroot /.snapshots/37/snapshot dracut --quiet --reproducible --force --tmpdir /var/tmp /tmp/sdbootutil.BXP6iZ/initrd-0 6.12.6-1-default
dracut[F]: No authorized_keys for root user found!

See https://github.com/gsauthof/dracut-sshd/blob/5d9d6893fda21bce99c26af52aab9985339ab63f/46sshd/module-setup.sh#L40.

Also, only when writing this message did I check the version and it turns out that the version in tumbleweed of the package is oldish (0.6.1, about 5 years old in comparison to 6 months old latest 0.6.7) and doesn't support putting keys in /etc.
(I'll probably need to go and figure out whether to contact the maintainer, try to make a patch for it in OBS (either bumping the version or just adding another authorized_keys path to be used by the installation script), or even do something else.)

The log output makes sense since the allowed locations for the keys are as per the documentation:

/root/.ssh/dracut_authorized_keys
/root/.ssh/authorized_keys
/etc/dracut-sshd/authorized_keys # Not available in version present in openSUSE repos!

and since I configured the plugin before the whole debugging started, I ended up putting them under /root, which works when using transactional-update initrd and doesn't when using sdbootutil mkinitrd --default-snapshot (since /root is another BTRFS subvolume).

I'm going to live with mdadm problems just by adding add_dracutmodules+=" mdadm " to /etc/dracut.conf.d/mdadm.conf.
As to ssh, I'll probably just patch /lib/dracut/modules.d/46sshd/module-setup.sh on my machine (it's just my home server, I don't rely on its availability) to use /etc path (hoping that when the new version lands in the repository it's going to contain the fix (be it from my patch or a version bump)).

I also read some dracut code and played a bit with preparing chroot the way it is done by sdbootutil in the logs (plus a local directory mounted at /boot in chroot) and ran bash -x /usr/bin/dracut (<OTHER_PARAMETERS_FROM_LOGS>) and compared it with output of the same command run on the host without chroot, but it rather unsurprisingly turned out that the check() functions of dracut modules run in shells that do not inherit set -x, so the logs were nearly the same. Thus, I still don't know how exactly dracut's $host_fs_types is populated (and so why the current chroot is insufficient).

Once again, thank you for your help, @aplanas.

@prawilny prawilny changed the title transactional-update.service seems to generate a different initrd than transactional-update initrd Bad initrd generation for non-default snapshot when using systemd-boot and dracut modules: mdraid, dracut-sshd Jan 7, 2025
@prawilny
Copy link
Author

prawilny commented Jan 8, 2025

Update: I realized than when playing with chroot yesterday, I forgot to also bind mount /boot (which is the mount point that utilizes mdraid). After adding it (/boot bind mount (and /boot/efi since I decided to test with them both)) to the chroot created like in sdbootutil trace, the mdraid appeared in the output.
See the attached dracut in chroot output.

If you want me to try some fixes or answer some questions, just ask.
Apart from that, I think I received the help I needed. Should I close the issue (after your reply)?

Out of curiousity, do you have any plans to prevent others in the future from stepping into this trap of different behavior for default and nondefault snapshot?

@aplanas
Copy link
Collaborator

aplanas commented Jan 8, 2025

I forgot to also bind mount /boot (which is the mount point that utilizes mdraid)

Oh ... seems to me that you maybe have a good clue here. sdbootutil is not doing the bind mount of /boot (only root and etc for the correct overlay). I can create a package for you with a version of sdbootutil that does this mount to see if it address the problem. I will post the address here in case you want to test it.

@aplanas
Copy link
Collaborator

aplanas commented Jan 8, 2025

Out of curiousity, do you have any plans to prevent others in the future from stepping into this trap of different behavior for default and nondefault snapshot?

I believe that sdbootutil is doing the right thing for now. There is documentation of this behavior in the help, the plugins are doing the correct call, and is consistent with how it was designed.

But for transactional-update, when call sdbootutil, yes, there is something off here. I can add a new parameter in transactional-update that can indicate for what snapshot do we want the initrd. But this is not consistent with how it works when the bootloader is the traditional grub2, as the initrd is still created for the current active snapshot. So I am not sure.

@prawilny
Copy link
Author

prawilny commented Jan 8, 2025

If you want to help me, I'll ask you for your work only if my workaround with adding mdraid to dracut config with a drop-in turns out not to work.

If you want me to test the change you're going to push upstream, I'll gladly help and test it.

To be precise about mounts, in my case sdbootutil mounts all /, /proc, /dev, /sys, /var, /tmp, /etc/, not "only root and etc". Unless by "etc" you meant "et cetera" and not "/etc" :)

@aplanas
Copy link
Collaborator

aplanas commented Jan 8, 2025

Yes, /proc and the others are sure mounted [1], but not /boot. After you debug session I am wondering if this is not required too.

[1] https://github.com/openSUSE/sdbootutil/blob/main/sdbootutil#L608

@prawilny
Copy link
Author

prawilny commented Jan 8, 2025

I'm also not sure what the correct behavior is. Both for transactional-update and sdbootutil. Just wanted to document the use case.

Could you please link me the sdbootutil documentation you mentioned? Or do I just need to read the scripts?

@aplanas
Copy link
Collaborator

aplanas commented Jan 8, 2025

Could you please link me the sdbootutil documentation you mentioned?

https://github.com/openSUSE/sdbootutil/blob/main/sdbootutil#L113-L115

@prawilny
Copy link
Author

prawilny commented Jan 8, 2025

Could you please link me the sdbootutil documentation you mentioned?

https://github.com/openSUSE/sdbootutil/blob/main/sdbootutil#L113-L115

Ah, I thought you said that what's mounted is documented somewhere.
Yes, that is what I had in mind when saying that I should've found the command on my own.
Thanks.

Also, I'll start work soon, so I'll respond only in the evening if there's something to reply to.

aplanas added a commit to aplanas/sdbootutil that referenced this issue Jan 9, 2025
aplanas added a commit to aplanas/sdbootutil that referenced this issue Jan 9, 2025
@aplanas
Copy link
Collaborator

aplanas commented Jan 9, 2025

@prawilny thanks for your patience. I changed how the chroot is created in sdbootutil in openSUSE/sdbootutil#183, and I package it in this repo: https://download.opensuse.org/repositories/home:/aplanas:/branches:/devel:/microos:/images/openSUSE_Tumbleweed/

Do you want to test it? The change will allow dracut to access directories like /root and /boot/efi, but I did not test the mdraid case

aplanas added a commit to aplanas/sdbootutil that referenced this issue Jan 9, 2025
aplanas added a commit to aplanas/sdbootutil that referenced this issue Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants