Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to add new disk #6531

Closed
lanfon72 opened this issue Sep 9, 2024 · 9 comments
Closed

[BUG] Unable to add new disk #6531

lanfon72 opened this issue Sep 9, 2024 · 9 comments
Assignees
Labels
area/node-disk-manager kind/bug Issues that are defects reported by users or that we know have reached a real release not-require/release-note not-require/test-plan Skip to create a e2e automation test issue reproduce/always Reproducible 100% of the time severity/1 Function broken (a critical incident with very high impact)
Milestone

Comments

@lanfon72
Copy link
Member

lanfon72 commented Sep 9, 2024

Describe the bug

Unable to add new disk
image
image
image
image

To Reproduce

Steps to reproduce the behavior:

  1. Install Harvester with any nodes having additional disks
  2. Edit Host page to add new disk

Expected behavior

Disk should added successfully

Environment:

  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): qemu/KVM 3 nodes
  • Harvester ISO version: v1.4.0-dev-20240830
  • ui-source Option: Auto

Additional context

The screenshot in v1.3.2 (same configuration in qemu/KVM)
image

@lanfon72 lanfon72 added kind/bug Issues that are defects reported by users or that we know have reached a real release severity/1 Function broken (a critical incident with very high impact) area/node-disk-manager reproduce/always Reproducible 100% of the time not-require/test-plan Skip to create a e2e automation test issue labels Sep 9, 2024
@lanfon72 lanfon72 added this to the v1.4.0 milestone Sep 9, 2024
@khushboo-rancher
Copy link

@lanfon72 Is this happening on 1.3.2 also? Is this regression?

@tserong
Copy link
Contributor

tserong commented Sep 10, 2024

From the screenshot it looks like node-disk-manager has picked up a dm device, which we actually don't want it to be doing. This is probably because multipath is enabled in the initrd, which has since been fixed by harvester/harvester-installer#831, but even so, we still need to make sure NDM doesn't use dm devices. I'm working in this area ATM anyway, so I'll take this for further investigation /cc @Vicente-Cheng @ibrokethecloud

@tserong tserong self-assigned this Sep 10, 2024
@albinsun
Copy link

Looks ok in v1.3.2
image

@harvesterhci-io-github-bot
Copy link
Collaborator

harvesterhci-io-github-bot commented Sep 10, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: this PR

  • Have the backend code been merged (harvester, harvester-installer, etc) (including backport-needed/*)?
    The PR is at: Use BusPath to resolve dm device paths node-disk-manager#130

    • Does the PR include the explanation for the fix or the feature?

@Vicente-Cheng
Copy link
Contributor

As @tserong mentioned, this regression is introduced with external root disk support (because we need the multipath) and fixed by the PR harvester/harvester-installer#831.

However, for the NDM side, we could improve by skipping the dm device. We could create another issue for this enhancement.

@tserong
Copy link
Contributor

tserong commented Sep 10, 2024

Yeah, the issue is to do with how we map WWNs back through /dev/disk/by-id/ paths in ResolvePersistentDevPath() in NDM when multipath is active.

@lanfon72 if you've still got that system running, I'd be interested to see the output of kubectl -n longhorn-system get bds -o yaml. I suspect you'll see something like this:

- apiVersion: harvesterhci.io/v1beta1
  kind: BlockDevice
  ...
  spec:
    devPath: /dev/sda
    ...
  status:
    deviceStatus:
      ...
      devPath: /dev/dm-0

...which indicates that it picked up the device originally from /dev/sda, then later got confused by multipath being active (which is why status.deviceStatus.devPath shows /dev/dm-0).

This won't be a problem for the next sprint release (because of the fix @Vicente-Cheng mentioned), but I will still do some work to improve ResolvePersistentDevPath().

@lanfon72
Copy link
Member Author

lanfon72 commented Sep 10, 2024

Yeah, the issue is to do with how we map WWNs back through /dev/disk/by-id/ paths in ResolvePersistentDevPath() in NDM when multipath is active.

@lanfon72 if you've still got that system running, I'd be interested to see the output of kubectl -n longhorn-system get bds -o yaml. I suspect you'll see something like this:

- apiVersion: harvesterhci.io/v1beta1
  kind: BlockDevice
  ...
  spec:
    devPath: /dev/sda
    ...
  status:
    deviceStatus:
      ...
      devPath: /dev/dm-0

...which indicates that it picked up the device originally from /dev/sda, then later got confused by multipath being active (which is why status.deviceStatus.devPath shows /dev/dm-0).

This won't be a problem for the next sprint release (because of the fix @Vicente-Cheng mentioned), but I will still do some work to improve ResolvePersistentDevPath().

I think the problem is:

  1. the disk path is not correct. (user will expect some path like /dev/sda)
  2. disk added, but information of storage/path/scheduling is not available.

The output of kubectl -n longhorn-system get bds -o yaml:
image

but wwn is unknown, which is not a correct value. (we did input WWN for the additional disk)

@tserong
Copy link
Contributor

tserong commented Sep 11, 2024

Thanks @lanfon72. The WWN would have been there originally, but it gets wiped in a subsequent update when status.deviceStatus.devPath changes to /dev/dm-0. I've traced this through the code, now I just need to fix it :-)

tserong added a commit to tserong/node-disk-manager that referenced this issue Sep 11, 2024
If multipathd is running and has taken over a device,
ResolvePersistentDevPath() will end up returning a "/dev/dm-*"
device path, which we don't want.  We want the real underyling
device (e.g. "/dev/sda").  If we take a "/dev/dm-*" path and
later update the blockdevice CR with it, we lose all the
interesting DeviceStatus information, like the WWN.

Happily, in this case, we can figure out the right path to
use by checking "/dev/disk/by-path/" + the device's bus path.
This has the added advantage of also working for block devices
that have no WWN.

Related issue: harvester/harvester#6531

Signed-off-by: Tim Serong <[email protected]>
tserong added a commit to tserong/node-disk-manager that referenced this issue Sep 24, 2024
If multipathd is running and has taken over a device,
ResolvePersistentDevPath() will end up returning a "/dev/dm-*"
device path, which we don't want.  We want the real underyling
device (e.g. "/dev/sda").  If we take a "/dev/dm-*" path and
later update the blockdevice CR with it, we lose all the
interesting DeviceStatus information, like the WWN.

Happily, in this case, we can figure out the right path to
use by checking "/dev/disk/by-path/" + the device's bus path.
This has the added advantage of also working for block devices
that have no WWN.

Related issue: harvester/harvester#6531

Signed-off-by: Tim Serong <[email protected]>
bk201 pushed a commit to harvester/node-disk-manager that referenced this issue Sep 25, 2024
If multipathd is running and has taken over a device,
ResolvePersistentDevPath() will end up returning a "/dev/dm-*"
device path, which we don't want.  We want the real underyling
device (e.g. "/dev/sda").  If we take a "/dev/dm-*" path and
later update the blockdevice CR with it, we lose all the
interesting DeviceStatus information, like the WWN.

Happily, in this case, we can figure out the right path to
use by checking "/dev/disk/by-path/" + the device's bus path.
This has the added advantage of also working for block devices
that have no WWN.

Related issue: harvester/harvester#6531

Signed-off-by: Tim Serong <[email protected]>
mergify bot pushed a commit to harvester/node-disk-manager that referenced this issue Sep 25, 2024
If multipathd is running and has taken over a device,
ResolvePersistentDevPath() will end up returning a "/dev/dm-*"
device path, which we don't want.  We want the real underyling
device (e.g. "/dev/sda").  If we take a "/dev/dm-*" path and
later update the blockdevice CR with it, we lose all the
interesting DeviceStatus information, like the WWN.

Happily, in this case, we can figure out the right path to
use by checking "/dev/disk/by-path/" + the device's bus path.
This has the added advantage of also working for block devices
that have no WWN.

Related issue: harvester/harvester#6531

Signed-off-by: Tim Serong <[email protected]>
(cherry picked from commit cefcc3d)
bk201 pushed a commit to harvester/node-disk-manager that referenced this issue Sep 25, 2024
If multipathd is running and has taken over a device,
ResolvePersistentDevPath() will end up returning a "/dev/dm-*"
device path, which we don't want.  We want the real underyling
device (e.g. "/dev/sda").  If we take a "/dev/dm-*" path and
later update the blockdevice CR with it, we lose all the
interesting DeviceStatus information, like the WWN.

Happily, in this case, we can figure out the right path to
use by checking "/dev/disk/by-path/" + the device's bus path.
This has the added advantage of also working for block devices
that have no WWN.

Related issue: harvester/harvester#6531

Signed-off-by: Tim Serong <[email protected]>
(cherry picked from commit cefcc3d)
@lanfon72 lanfon72 self-assigned this Oct 10, 2024
@lanfon72
Copy link
Member Author

Verified this bug has been fixed.

image

Test Information

  • Environment: qemu/KVM 3 nodes
  • Harvester Version: v1.4.0-rc1
  • ui-source Option: Auto

Verify Steps

Vicente-Cheng pushed a commit to Vicente-Cheng/node-disk-manager that referenced this issue Dec 3, 2024
If multipathd is running and has taken over a device,
ResolvePersistentDevPath() will end up returning a "/dev/dm-*"
device path, which we don't want.  We want the real underyling
device (e.g. "/dev/sda").  If we take a "/dev/dm-*" path and
later update the blockdevice CR with it, we lose all the
interesting DeviceStatus information, like the WWN.

Happily, in this case, we can figure out the right path to
use by checking "/dev/disk/by-path/" + the device's bus path.
This has the added advantage of also working for block devices
that have no WWN.

Related issue: harvester/harvester#6531

Signed-off-by: Tim Serong <[email protected]>
(cherry picked from commit cefcc3d)

Conflicts:
	pkg/controller/blockdevice/controller.go
	  - move the corresponding logic back to blockdevice/controller
	pkg/provisioner/common.go
	  - remove provisioner because we did not introduce it on v0.6.x
WebberHuang1118 pushed a commit to harvester/node-disk-manager that referenced this issue Dec 25, 2024
If multipathd is running and has taken over a device,
ResolvePersistentDevPath() will end up returning a "/dev/dm-*"
device path, which we don't want.  We want the real underyling
device (e.g. "/dev/sda").  If we take a "/dev/dm-*" path and
later update the blockdevice CR with it, we lose all the
interesting DeviceStatus information, like the WWN.

Happily, in this case, we can figure out the right path to
use by checking "/dev/disk/by-path/" + the device's bus path.
This has the added advantage of also working for block devices
that have no WWN.

Related issue: harvester/harvester#6531

Signed-off-by: Tim Serong <[email protected]>
(cherry picked from commit cefcc3d)

Conflicts:
	pkg/controller/blockdevice/controller.go
	  - move the corresponding logic back to blockdevice/controller
	pkg/provisioner/common.go
	  - remove provisioner because we did not introduce it on v0.6.x
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/node-disk-manager kind/bug Issues that are defects reported by users or that we know have reached a real release not-require/release-note not-require/test-plan Skip to create a e2e automation test issue reproduce/always Reproducible 100% of the time severity/1 Function broken (a critical incident with very high impact)
Projects
None yet
Development

No branches or pull requests

7 participants