Skip to content

Latest commit

 

History

History
302 lines (205 loc) · 12.8 KB

deploy_non-compute_nodes.md

File metadata and controls

302 lines (205 loc) · 12.8 KB

Deploy Management Nodes

The following procedure deploys Linux and Kubernetes software to the management NCNs. Deployment of the nodes starts with booting the storage nodes, followed by the master nodes and worker nodes together.

After the operating system boots on each node, there are some configuration actions which take place. Watching the console or the console log for certain nodes can help to understand what happens and when. When the process completes for all nodes, the Ceph storage is initialized and the Kubernetes cluster is created and ready for a workload. The PIT node will join Kubernetes after it is rebooted later in Deploy Final NCN.

Timing of deployments

The timing of each set of boots varies based on hardware. Nodes from some manufacturers will POST faster than others or vary based on BIOS setting. After powering on a set of nodes, an administrator can expect a healthy boot session to take about 60 minutes depending on the number of storage and worker nodes.

Topics

  1. Prepare for management node deployment
    1. Tokens and IPMI password
    2. BIOS baseline
  2. Deploy management nodes
    1. Deploy storage NCNs
    2. Deploy Kubernetes NCNs
    3. Configure kubectl on the PIT
  3. Validate deployment
  4. Next topic

1. Prepare for management node deployment

Preparation of the environment must be done before attempting to deploy the management nodes.

1.1 Tokens and IPMI password

  1. (pit#) Define shell environment variables that will simplify later commands to deploy management nodes.

    1. Set USERNAME and IPMI_PASSWORD to the credentials for the NCN BMCs.

      read -s is used to prevent the password from being written to the screen or the shell history.

      USERNAME=root
      read -r -s -p "NCN BMC ${USERNAME} password: " IPMI_PASSWORD
    2. Set the remaining helper variables.

      These values do not need to be altered from what is shown.

      export IPMI_PASSWORD ; mtoken='ncn-m(?!001)\w+-mgmt' ; stoken='ncn-s\w+-mgmt' ; wtoken='ncn-w\w+-mgmt'

1.2. BIOS baseline

  1. (pit#) If the NCNs are HPE hardware, then ensure that DCMI/IPMI is enabled.

    This will enable ipmitool usage with the BMCs.

    /root/bin/bios-baseline.sh
  2. (pit#) Check power status of all NCNs.

    grep -oP "(${mtoken}|${stoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u |
          xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} power status
  3. (pit#) Power off all NCNs.

    grep -oP "(${mtoken}|${stoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u |
          xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} power off
  4. (pit#) Clear CMOS; ensure default settings are applied to all NCNs.

    NOTE: Gigabyte Servers and Intel Servers should SKIP THIS STEP.

    Resetting the CMOS will:

    • Disable Hyper-Threading® on Intel CPUs; there is no way to enable it remotely through CSM at this time.
    • Disable VT-x, AMD-V, SVM, VT-d, and AMD IOMMU for Virtualization, on both AMD and Intel CPUs; there is no way to enable at this time.
    grep -oP "(${mtoken}|${stoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u |
          xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} chassis bootdev none options=clear-cmos
  5. (pit#) Boot NCNs to BIOS to allow the CMOS to reinitialize.

    grep -oP "(${mtoken}|${stoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u |
          xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} chassis bootdev bios options=efiboot
    grep -oP "(${mtoken}|${stoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u |
          xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} power on
  6. (pit#) Run bios-baseline.sh.

    NOTE: For HPE servers, this should still be done, even though it was already run earlier in the procedure.

    /root/bin/bios-baseline.sh
  7. (pit#) Power off the nodes.

    grep -oP "(${mtoken}|${stoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u |
          xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} power off

2. Deploy management nodes

Deployment of the nodes starts with booting the storage nodes first. Then, the master nodes and worker nodes should be booted together. After the operating system boots on each node, there are some configuration actions which take place. Watching the console or the console log for certain nodes can help to understand what happens and when. When the process is complete for all nodes, the Ceph storage will have been initialized and the Kubernetes cluster will be created ready for a workload.

  1. (pit#) Customize boot scripts for any out-of-baseline NCNs if needed (see below).

    • See the Plan of Record and compare against the server's hardware.
    • If modifications are needed for the PCIe hardware, then see Customize PCIe Hardware.
    • If modifications for disk usage are necessary, then see Customize Disk Hardware.
    • If any customizations were done, backup the new boot scripts for reinstallation in /var/www/ncn-*/script.ipxe (e.g. tar -czvf $SYSTEM_NAME-boot-scripts.tar.gz /var/www/ncn-*/script.ipxe).
  2. (pit#) Set each node to always UEFI network boot, and ensure that they are powered off.

    grep -oP "(${mtoken}|${stoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u | xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} chassis bootdev pxe options=efiboot,persistent
    grep -oP "(${mtoken}|${stoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u | xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} power off

    NOTE: The NCN boot order is further explained in NCN Boot Workflow.

2.1 Deploy storage NCNs

  1. (pit#) Boot the storage NCNs.

    grep -oP "${stoken}" /etc/dnsmasq.d/statics.conf | sort -u | xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} power on 
  2. (pit#) Observe the installation through the console of ncn-s001-mgmt.

    conman -j ncn-s001-mgmt

    From there, an administrator can witness console output for the cloud-init scripts.

    NOTES:

    • Watch the storage node consoles carefully for error messages. If any are seen, consult Ceph-CSI Troubleshooting.
    • If the nodes have PXE boot issues (for example, getting PXE errors, or not pulling the ipxe.efi binary), then see PXE boot troubleshooting.
    • If ncn-s001 console has the message 'Sleeping for five seconds waiting ceph to be healthy...' for an extended period of time, then see Utility Storage Installation Troubleshooting.
    • In the deployment of storage NCNs, the console may show errors regarding cray-heartbeat.service. These are expected until the PIT is deployed as ncn-m001.
  3. (pit#) Wait for storage nodes to output the following before booting Kubernetes master nodes and worker nodes.

    ...sleeping 5 seconds until /etc/kubernetes/admin.conf 
    

2.2 Deploy Kubernetes NCNs

  1. (pit#) Boot the Kubernetes NCNs.

    grep -oP "(${mtoken}|${wtoken})" /etc/dnsmasq.d/statics.conf | sort -u | xargs -t -i ipmitool -I lanplus -U "${USERNAME}" -E -H {} power on
  2. (pit#) Start watching the first Kubernetes master's console.

    Either stop watching ncn-s001-mgmt before doing this, or do it in a different window.

    NOTE: To exit a conman console, press & followed by a . (e.g. keystroke &.)

    1. Determine the first Kubernetes master.

      FM=$(jq -r '."Global"."meta-data"."first-master-hostname"' "${PITDATA}"/configs/data.json)
      echo ${FM}
    2. Open its console.

      conman -j "${FM}-mgmt"

    NOTES:

    • If the nodes have PXE boot issues (e.g. getting PXE errors, not pulling the ipxe.efi binary), then see Troubleshooting PXE Boot.
    • If one of the master nodes seems hung waiting for the storage nodes to create a secret, then check the storage node consoles for error messages. If any are found, then consult CEPH CSI Troubleshooting.
  3. (pit#) Wait for the deployment to finish.

    1. Wait for the first Kubernetes master to complete cloud-init.

      The following text should appear in the console of the first Kubernetes master:

      The system is finally up, after 995.71 seconds cloud-init has come to completion.
      

      NOTES:

      • The duration reported will vary.
      • All NCNs should report the above text when they have completed their Ceph or Kubernetes installation.
    2. Validate that all master and worker NCNs (except for ncn-m001) show up in the cluster.

      Enter the root password for the first Kubernetes master node, if prompted.

      ssh "${FM}" kubectl get nodes -o wide

      Expected output looks similar to the following:

      NAME       STATUS   ROLES                  AGE     VERSION    INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                  KERNEL-VERSION                 CONTAINER-RUNTIME
      ncn-m002   Ready    control-plane,master   7m39s   v1.22.13   10.252.1.5    <none>        SUSE Linux Enterprise High Performance Computing 15 SP5   5.14.21-150500.55.12-default   containerd://1.5.16
      ncn-m003   Ready    control-plane,master   7m16s   v1.22.13   10.252.1.6    <none>        SUSE Linux Enterprise High Performance Computing 15 SP5   5.14.21-150500.55.12-default   containerd://1.5.16
      ncn-w001   Ready    <none>                 7m16s   v1.22.13   10.252.1.7    <none>        SUSE Linux Enterprise High Performance Computing 15 SP5   5.14.21-150500.55.12-default   containerd://1.5.16
      ncn-w002   Ready    <none>                 7m18s   v1.22.13   10.252.1.8    <none>        SUSE Linux Enterprise High Performance Computing 15 SP5   5.14.21-150500.55.12-default   containerd://1.5.16
      ncn-w003   Ready    <none>                 7m16s   v1.22.13   10.252.1.9    <none>        SUSE Linux Enterprise High Performance Computing 15 SP5   5.14.21-150500.55.12-default   containerd://1.5.16
      
  4. (pit#) Stop watching the consoles.

    Exit the first master's console; also exit the console for ncn-s001, if it was left open.

    NOTE: To exit a conman console, press & followed by a . (e.g. keystroke &.)

2.3 Configure kubectl on the PIT

  1. (pit#) This was done in a previous step, but if the user is resuming/starting here then the first master needs to be redefined.

    NOTE This requires that the set reusable environment variables step was completed, PITDATA should be defined in the users environment before continuing.

    FM=$(jq -r '."Global"."meta-data"."first-master-hostname"' "${PITDATA}"/configs/data.json)
    echo ${FM}
  2. (pit#) Copy the Kubernetes configuration file from the first master node to the LiveCD.

    This will allow kubectl to work from the PIT node.

    mkdir -v ~/.kube
    scp "${FM}.nmn:/etc/kubernetes/admin.conf" ~/.kube/config

3. Validate deployment

  1. (pit#) Ensure that the working directory is the prep directory.

    cd "${PITDATA}/prep"
  2. (pit#) Check cabling.

    See SHCD check cabling guide.

  3. (pit#) Check the storage nodes.

    csi pit validate --ceph

    For assistance resolving failed tests, see the following pages:

  4. (pit#) Check the master and worker nodes.

    csi pit validate --k8s

Next topic

After completing the deployment of the management nodes, the next step is to install the CSM services.

See Install CSM Services.