From 17285698cf7a24689cd2dc68a0aa886fab221fee Mon Sep 17 00:00:00 2001 From: Ravikanth Nalla Date: Thu, 16 Jan 2025 13:39:01 +0000 Subject: [PATCH 1/4] CASMTRIAGE-7509: Create workaround for USS-1.1 customers - WAR doc for enabling iscsid and multipathd services to support iSCSI SBPS for USS-1.1 customers. --- .../known_issues/enable_iscsid_multipathd.md | 225 ++++++++++++++++++ 1 file changed, 225 insertions(+) create mode 100644 troubleshooting/known_issues/enable_iscsid_multipathd.md diff --git a/troubleshooting/known_issues/enable_iscsid_multipathd.md b/troubleshooting/known_issues/enable_iscsid_multipathd.md new file mode 100644 index 000000000000..50de43d90d09 --- /dev/null +++ b/troubleshooting/known_issues/enable_iscsid_multipathd.md @@ -0,0 +1,225 @@ +# Boot content projection services can fail if `iscsid` and `multipathd` services are not enabled on compute nodes and UANs + +## Issue Description + +iSCSI based boot content projection which is also known as "Scalable Boot Content Projection" (SBPS) for `rootfs` and `PE` images +is supported in version CSM 1.6.0 and above. On a customer system, using `CSM-1.6.0` with `USS-1.1.x` on compute nodes/ UANs in order to support AARCH64 images, +`iscsid` and `multipathd` services are not enabled by default. SBPS will not be resilient across worker node reboots if these services are not enabled by default on compute/ UANs. + +## Issue Identification + +This issue can be identified by the following symptoms: + +On a compute/UAN node (iSCSI Initiator) we can observe the following SQUASHFS error messages in the console log: + +```text +nid000004:~ # dmesg -T | grep "SQUASHFS error" | head -n 1 +[Sat Nov 2 22:32:41 2024] SQUASHFS error: xz decompression failed, data probably corrupt +``` + +On a compute/UAN node (iSCSI Initiator) we can observe that the `iscsid` service is not active: + +```bash +nid000004:~ # systemctl status iscsid +● iscsid.service - Open-iSCSI + Loaded: loaded (/usr/lib/systemd/system/iscsid.service; disabled; preset: disabled) + Active: active (running) since Wed 2024-11-06 08:16:23 CST; 1 day 4h ago +TriggeredBy: ● iscsid.socket +``` + +From the `journalctl` logs: + +```text +Nov 07 10:24:23 nid000004 iscsid[22286]: iscsid: Kernel reported iSCSI connection 2:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (3) +Nov 07 10:25:14 nid000004 iscsid[22286]: iscsid: connect to 10.253.0.3:3260 failed (No route to host) +... +Nov 07 10:30:43 nid000004 iscsid[22286]: iscsid: connect to 10.253.0.3:3260 failed (Connection refused) +... +``` + +## Workaround Description + +### 1: Get the version of the `csm-packages` from compute node BOS session template + +Example: + +Find the session template name: + +```bash +ncn-m001:~ # cray bos sessiontemplates list | grep compute-* +… +enable_cfs = true +name = "compute-25.1.0-alpha2.x86_64-csm-160-rc4" +… +``` + +Find the `configuration` name from the `sessiontemplates` describe: + +```bash +ncn-m001:~ # cray bos sessiontemplates describe compute-25.1.0-alpha2.x86_64-csm-160-rc4 --format json +{… + "cfs": { + "configuration": "compute-25.1.0-alpha2-csm-160-rc4" + }, +… +} +``` + +Use the `configuration` value to describe the configuration: + +```bash +ncn-m001:~ # cray cfs configurations describe compute-25.1.0-alpha2-csm-160-rc4 --format json +{ + "lastUpdated": "2024-11-02T12:15:21Z", + "layers": [ + { + "cloneUrl": "https://api-gw-service-nmn.local/vcs/cray/csm-config-management.git", + "commit": "d530f0a277c9d5dc9e3cb487d32d6b316757f00e", + "name": "csm-packages-1.6.0-rc.4",  + "playbook": "csm_packages.yml" + }, +… +``` + +The `name` from the describe above identifies product catalog, the version is after `csm-packages-` + +### 2: Get the corresponding `csm-config` branch (@VCS) from product catalog given `csm-packages-*` version found from `Step-1` + +Example: + +```bash +ncn-m001:~ # kubectl get cm -n services cray-product-catalog -o yaml | yq - r 'data.csm' | grep ^1.6.0-rc.4: -A 10 +1.6.0-rc.4: + configuration: + clone_url: https://vcs.cmn.fanta.hpc.amslabs.hpecorp.net/vcs/cray/csm-config-management.git + commit: d530f0a277c9d5dc9e3cb487d32d6b316757f00e + import_branch: cray/csm/1.27.2 +``` + +The `import_branch` from above output will be used below. + +### 3: Log into VCS and clone `csm-config-management.git` @ VCS + +```bash +ncn-m001:/home/ # USERNAME=$( kubectl get secrets -n services vcs-user-credentials -o json | jq -r .data.vcs_username | base64 -d ) +ncn-m001:/home/ # PSWD=$( kubectl get secrets -n services vcs-user-credentials -o json | jq -r .data.vcs_password | base64 -d ) +ncn-m001:/home/ # git clone https://api-gw-service-nmn.local/vcs/cray/csm-config-management.git +``` + +**Note:** use above $USERNAME and $PSWD for VCS login + +### 4: Apply fix against `import_branch` found in `step-2` + +Example: + +```bash +ncn-m001:/home/ # cd csm-config-management +ncn-m001:/home/csm-config-management # git checkout cray/csm/1.27.2 +ncn-m001:/home/csm-config-management # git checkout -b CASMTRIAGE-7509 +``` + +Note: `cray/csm/1.27.2` is a target branch and `CASMTRIAGE-7509` is a new branch + +Add new role to enable `iscsid` and `multipathd` service: + +```bash +cat > roles/csm.enable_iscsid_multipathd/tasks/main.yml << EOF +--- +- name: Ensure iscsid service is started + ansible.builtin.systemd: + name: iscsid + state: started + enabled: true + +- name: Ensure multipathd service is started + ansible.builtin.systemd: + name: multipathd + state: started + enabled: true +EOF +``` + +Apply the following changes to `csm-config-management/csm_packages.yml` to Application-nodes only play and Compute-nodes only play under `csm_services` in order to enable `iscsid` and `multipathd` services. + +```bash +diff --git a/csm_packages.yml b/csm_packages.yml +index e3366f8..b223aec 100755 +--- a/csm_packages.yml ++++ b/csm_packages.yml +@@ -137,6 +137,9 @@ + vars: + packages: "{{application_csm_sles_packages }}" + when: ansible_distribution_file_variety == "SUSE" ++ # Enable iscsid and multipathd service ++ - role: csm.enable_iscsid_multipathd ++ + tasks: + - name: Enable smart service + systemd: +@@ -148,3 +151,12 @@ + name: cray-node-exporter + state: started + enabled: true ++ ++# Compute-nodes only play ++- hosts: Compute:!cfs_image ++ gather_facts: no ++ any_errors_fatal: true ++ remote_user: root ++ roles: ++ # Enable iscsid and multipathd service ++ - role: csm.enable_iscsid_multipathd +``` + +### 5: Commit the changes and push them to VCS + +Example: + +```bash +ncn-m001:/home/csm-config-management # git add csm_packages.yml +ncn-m001:/home/csm-config-management # git commit -m "fix for CASMTRIAGE-7509" +ncn-m001:/home/csm-config-management # git push --set-upstream origin CASMTRIAGE-7509 + +ncn-m001:/home/csm-config-management # COMMIT=$(git log -1 --pretty='format:%H') +ncn-m001:/home/csm-config-management # echo $COMMIT +bf214b8a9867531a38f8ca28b6ffae1fe56724ce +``` + +### 6: Create new CFS configuration with above change to be applied + +Example: + +```bash +ncn-m001:/home/csm-config-management # SESSIONTEMPLATE=compute-25.1.0-alpha2.x86_64-csm-160-rc4 +ncn-m001:/home/csm-config-management # CFS_CONFIG=`cray bos sessiontemplates describe $SESSIONTEMPLATE --format json | jq -r .cfs.configuration` +ncn-m001:/home/csm-config-management # cray cfs configurations describe $CFS_CONFIG --format json | jq '. | del(.lastUpdated) | del(.name)' > $CFS_CONFIG +``` + +### 7: Update commit id (`$COMMIT` from `step 5` in `$CFS_CONFIG` for `csm_packages.yml`) + +```bash +ncn-m001:/home/csm-config-management # vim $CFS_CONFIG +``` + +```bash +ncn-m001:/home/csm-config-management # cat $CFS_CONFIG +{ + "layers": [ + { + "cloneUrl": "https://api-gw-service-nmn.local/vcs/cray/csm-config-management.git", + "commit": "bf214b8a9867531a38f8ca28b6ffae1fe56724ce",  + "name": "csm-packages-1.6.0-rc.4", + "playbook": "csm_packages.yml" + }, +… +``` + +### 8: Update `cfs` config + +```bash +ncn-m001:/home/csm-config-management # cray cfs configurations update --file $CFS_CONFIG $CFS_CONFIG +``` + +### 9: Create new BOS session template with this new config change + +Please refer to: [Create BOS session template for iSCSI SBPS](../../operations/boot_orchestration/Create_a_Session_Template_to_Boot_Compute_Nodes_with_SBPS.md) From 1697e74d6c5bf81dd488607074569103d38d1195 Mon Sep 17 00:00:00 2001 From: ravikanth-nalla-hpe <140072234+ravikanth-nalla-hpe@users.noreply.github.com> Date: Mon, 27 Jan 2025 14:24:33 +0530 Subject: [PATCH 2/4] Update enable_iscsid_multipathd.md Signed-off-by: ravikanth-nalla-hpe <140072234+ravikanth-nalla-hpe@users.noreply.github.com> --- .../known_issues/enable_iscsid_multipathd.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/troubleshooting/known_issues/enable_iscsid_multipathd.md b/troubleshooting/known_issues/enable_iscsid_multipathd.md index 50de43d90d09..26f8351cb01f 100644 --- a/troubleshooting/known_issues/enable_iscsid_multipathd.md +++ b/troubleshooting/known_issues/enable_iscsid_multipathd.md @@ -3,21 +3,21 @@ ## Issue Description iSCSI based boot content projection which is also known as "Scalable Boot Content Projection" (SBPS) for `rootfs` and `PE` images -is supported in version CSM 1.6.0 and above. On a customer system, using `CSM-1.6.0` with `USS-1.1.x` on compute nodes/ UANs in order to support AARCH64 images, -`iscsid` and `multipathd` services are not enabled by default. SBPS will not be resilient across worker node reboots if these services are not enabled by default on compute/ UANs. +is supported in CSM version CSM 1.6.0 and later. On a customer system, using `CSM-1.6.0` with `USS-1.1.x` on compute nodes/ UANs in order to support AARCH64 images, +`iscsid` and `multipathd` services are not enabled by default. SBPS will not be resilient across worker node reboots if these services are not enabled by default on compute nodes or UANs. ## Issue Identification This issue can be identified by the following symptoms: -On a compute/UAN node (iSCSI Initiator) we can observe the following SQUASHFS error messages in the console log: +On a compute node or UAN (iSCSI Initiator) we can observe the following SQUASHFS error messages in the console log: ```text nid000004:~ # dmesg -T | grep "SQUASHFS error" | head -n 1 [Sat Nov 2 22:32:41 2024] SQUASHFS error: xz decompression failed, data probably corrupt ``` -On a compute/UAN node (iSCSI Initiator) we can observe that the `iscsid` service is not active: +On a compute node or UAN (iSCSI Initiator) we can observe that the `iscsid` service is not active: ```bash nid000004:~ # systemctl status iscsid @@ -81,7 +81,7 @@ ncn-m001:~ # cray cfs configurations describe compute-25.1.0-alpha2-csm-160-rc4 … ``` -The `name` from the describe above identifies product catalog, the version is after `csm-packages-` +The `name` from the describe above identifies the product catalog. Use the version after `csm-packages-` in the next step. ### 2: Get the corresponding `csm-config` branch (@VCS) from product catalog given `csm-packages-*` version found from `Step-1` From 90735f69951b5f6c0054e54b481b17ecb008da98 Mon Sep 17 00:00:00 2001 From: ravikanth-nalla-hpe <140072234+ravikanth-nalla-hpe@users.noreply.github.com> Date: Wed, 29 Jan 2025 23:40:52 +0530 Subject: [PATCH 3/4] Update enable_iscsid_multipathd.md Signed-off-by: ravikanth-nalla-hpe <140072234+ravikanth-nalla-hpe@users.noreply.github.com> --- .../known_issues/enable_iscsid_multipathd.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/troubleshooting/known_issues/enable_iscsid_multipathd.md b/troubleshooting/known_issues/enable_iscsid_multipathd.md index 26f8351cb01f..294d5557c7de 100644 --- a/troubleshooting/known_issues/enable_iscsid_multipathd.md +++ b/troubleshooting/known_issues/enable_iscsid_multipathd.md @@ -12,15 +12,20 @@ This issue can be identified by the following symptoms: On a compute node or UAN (iSCSI Initiator) we can observe the following SQUASHFS error messages in the console log: -```text -nid000004:~ # dmesg -T | grep "SQUASHFS error" | head -n 1 +```bash +dmesg -T | grep "SQUASHFS error" | head -n 1 +``` + +Example output: + +``` [Sat Nov 2 22:32:41 2024] SQUASHFS error: xz decompression failed, data probably corrupt ``` On a compute node or UAN (iSCSI Initiator) we can observe that the `iscsid` service is not active: ```bash -nid000004:~ # systemctl status iscsid +ncn-s004# systemctl status iscsid ● iscsid.service - Open-iSCSI Loaded: loaded (/usr/lib/systemd/system/iscsid.service; disabled; preset: disabled) Active: active (running) since Wed 2024-11-06 08:16:23 CST; 1 day 4h ago From 079f780fcd39fb634cfd7baf01fa9f0cee83670a Mon Sep 17 00:00:00 2001 From: ravikanth-nalla-hpe <140072234+ravikanth-nalla-hpe@users.noreply.github.com> Date: Wed, 29 Jan 2025 23:46:18 +0530 Subject: [PATCH 4/4] Update enable_iscsid_multipathd.md Signed-off-by: ravikanth-nalla-hpe <140072234+ravikanth-nalla-hpe@users.noreply.github.com> --- troubleshooting/known_issues/enable_iscsid_multipathd.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/troubleshooting/known_issues/enable_iscsid_multipathd.md b/troubleshooting/known_issues/enable_iscsid_multipathd.md index 294d5557c7de..1e5fc06c3df4 100644 --- a/troubleshooting/known_issues/enable_iscsid_multipathd.md +++ b/troubleshooting/known_issues/enable_iscsid_multipathd.md @@ -18,7 +18,7 @@ dmesg -T | grep "SQUASHFS error" | head -n 1 Example output: -``` +```text [Sat Nov 2 22:32:41 2024] SQUASHFS error: xz decompression failed, data probably corrupt ```