Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML Stack - Blast Breaker Machine #1

Open
4 tasks
jaylong255 opened this issue Sep 27, 2024 · 11 comments
Open
4 tasks

ML Stack - Blast Breaker Machine #1

jaylong255 opened this issue Sep 27, 2024 · 11 comments

Comments

@jaylong255
Copy link
Member

jaylong255 commented Sep 27, 2024

  1. Install CUDA Toolkit
  2. Install cuDNN
  3. Install NVIDIA Drivers
  4. Set Up Python Environment
@jaylong255
Copy link
Member Author

Here's a high-level overview of preparing a laptop with an NVIDIA GeForce GPU running Ubuntu to run LLaMA models:

  1. Install CUDA Toolkit:

    • Download and install the CUDA Toolkit from NVIDIA's website. This is essential for GPU acceleration. Ensure you pick a version compatible with your GPU and Ubuntu version.
  2. Install cuDNN:

    • Obtain cuDNN, which provides primitives for deep networks. You'll need to sign up on NVIDIA's developer site to download this. Make sure it matches your CUDA version.
  3. Install NVIDIA Drivers:

    • Ensure you have the latest NVIDIA drivers. They might already be installed if you're using the proprietary drivers from the Ubuntu repositories or directly from NVIDIA.
  4. Set Up Python Environment:

    • Install Python (3.8 or higher is recommended for recent libraries).
    • Use conda or venv to create a virtual environment to manage dependencies without cluttering your system.
  5. Install Necessary Libraries:

    • pip install torch torchvision torchaudio - PyTorch with CUDA support for GPU acceleration.
    • pip install transformers - Hugging Face's Transformers library, which supports LLaMA and many other models.
  6. Download the LLaMA Model:

    • LLaMA models are not freely available due to licensing issues, but you can download them if you have access (often involves academic or commercial agreements).
    • Use transformers CLI or manually download from a source you have access to.
  7. Model Loading and Running:

    • Use the transformers library to load the model:
      from transformers import AutoModelForCausalLM, AutoTokenizer
      
      model_name = "your-llama-model-path"  # Replace with actual path or model ID if available publicly
      tokenizer = AutoTokenizer.from_pretrained(model_name)
      model = AutoModelForCausalLM.from_pretrained(model_name)
      model.to("cuda")  # Move model to GPU
    • Generate text with:
      inputs = tokenizer("Your input text", return_tensors="pt").to("cuda")
      outputs = model.generate(**inputs)
      print(tokenizer.decode(outputs[0], skip_special_tokens=True))
  8. Optimize for GPU Usage:

    • If running into memory issues, consider using 8-bit quantization, offloading, or model parallelism techniques provided by libraries like bitsandbytes or deepspeed.
  9. Testing and Tweaking:

    • Run some tests, maybe start with smaller models or shorter sequences to ensure everything works.
    • Adjust batch sizes or sequence lengths based on your GPU's memory.

Remember, running large models like LLaMA, especially the bigger variants, might require significant GPU memory. If your GeForce GPU has limited VRAM, you might need to use model parallelism or offloading strategies, or consider models that are optimized for lower memory usage.

Also, ensure your laptop's power settings are configured for performance to avoid thermal throttling during extended model runs.

@jaylong255
Copy link
Member Author

Enable SSH

sudo apt install -y openssh-server
sudo systemctl enable ssh
``

@jaylong255
Copy link
Member Author

sudo apt install -y build-essential libssl-dev dkms

@jaylong255
Copy link
Member Author

@jaylong255
Copy link
Member Author

sudo sh cuda_12.6.1_560.35.03_linux.run

@jaylong255
Copy link
Member Author

Here's a step-by-step guide on installing the CUDA Toolkit on an Ubuntu system with an NVIDIA GeForce GPU:

Step 1: Determine Your System Specifications

  • Ubuntu Version: Check what version of Ubuntu you're running (lsb_release -a).
  • GPU Compatibility: Use nvidia-smi to check your GPU model. Ensure it's compatible with the CUDA version you plan to install.

Step 2: Download the CUDA Toolkit

  • Visit the NVIDIA CUDA Download page.
  • Choose a version compatible with your Ubuntu version and GPU. As of 2024, CUDA 12.x might be current versions, but check for the latest.
  • Download the .run file for your Ubuntu version (e.g., cuda_12.2.1_510.47.03_linux.run).

Step 3: Install Dependencies

  • Before you install CUDA, make sure your system has the necessary dependencies. Open a terminal and run:
    sudo apt update
    sudo apt install -y build-essential libssl-dev dkms

Step 4: Install CUDA Toolkit

  • Navigate to the directory where you downloaded the .run file:
    cd /path/to/downloaded/file
  • Make the file executable:
    chmod +x cuda*.run
  • Run the installation script. This step might take some time:
    sudo ./cuda*.run
  • Follow the prompts. Typically, you'll just need to agree to the license agreement. The installer will handle the rest.

Step 5: Set Up Environment Variables

  • You need to add CUDA's bin and lib6 directories to your system's PATH and library path. Edit your .bashrc or .bash_profile:
    nano ~/.bashrc
  • Add these lines at the end (adjust paths based on your CUDA version and installation directory):
    export PATH=/usr/local/cuda-<version>/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/cuda-<version>/lib64:$LD_LIBRARY_PATH
  • Save and exit (Ctrl+X, then Y, Enter).

Step 6: Apply Changes

  • Reload your .bashrc:
    source ~/.bashrc

Step 7: Verify Installation

  • Check if CUDA is correctly installed by running:
    nvcc --version
    This should print the version of the CUDA Toolkit you installed.

Step 8: Install NVIDIA Drivers (if not already installed)

  • If you're not already using NVIDIA drivers, install them:
    sudo apt update
    sudo apt install nvidia-driver-<version>  # Replace <version> with the appropriate driver version for your GPU
    sudo reboot
    After reboot, check if the driver is loaded with nvidia-smi.

Step 9: Test with a Simple CUDA Program

  • Create a simple CUDA program:
    cat <<EOF > hello.cu
    #include <stdio.h>
    
    __global__ void helloFromGPU(void) {
        printf("Hello World from GPU!\n");
    }
    
    int main(void) {
        helloFromGPU<<<1, 1>>>();
        cudaDeviceSynchronize();
        return 0;
    }
    EOF
  • Compile and run:
    nvcc hello.cu -o hello
    ./hello
    You should see "Hello World from GPU!" if everything is set up correctly.

This installation will prepare your system to use NVIDIA GPUs for CUDA-accelerated computing tasks, including machine learning with PyTorch, TensorFlow, or other CUDA-enabled libraries. Remember, CUDA versions need to match with your GPU capabilities and other software like cuDNN or TensorFlow, so always check for compatibility.

@jaylong255
Copy link
Member Author

image

@jaylong255
Copy link
Member Author

If you encounter this message during the CUDA Toolkit installation, it's because NVIDIA has detected an existing driver installation through your package manager (like apt on Ubuntu). Here's how to proceed:

Option 1: Remove Existing Driver and Continue with CUDA's Driver

  • Remove the Existing Driver:

    sudo apt remove nvidia-*
    sudo reboot
  • After reboot, resume the CUDA installation. This will let the CUDA installer handle the driver installation, ensuring compatibility with the CUDA version you're installing.

Option 2: Keep Existing Driver and Install Only the CUDA Toolkit

  • If you prefer to keep your current NVIDIA drivers (perhaps because they're newer or better optimized for your system), you can choose Continue during the installation.

  • Note: Ensure your existing driver is compatible with the CUDA version you're installing. Check NVIDIA's compatibility matrix or documentation.

  • Post-Installation Steps:

    • After choosing Continue, the CUDA Toolkit will be installed without touching your existing drivers.

General Advice:

  • Compatibility: Always check if the existing driver version matches or is compatible with the CUDA Toolkit version. Mismatches can lead to issues with CUDA applications or system instability.

  • Driver Updates: If you later decide to update your NVIDIA drivers, ensure you update CUDA as well if necessary, or risk version mismatches.

  • Performance: NVIDIA's installer often provides a smoother experience with CUDA, which is why removing the existing driver and letting NVIDIA handle it is generally recommended.

  • Documentation: Refer to NVIDIA's official CUDA installation guide for your specific version, as steps might vary slightly with each release.

Post-Installation Verification:

Regardless of which option you choose, after installation:

  • Verify CUDA with nvcc --version.
  • Check if your GPU is recognized with nvidia-smi.
  • If everything seems correct but you face issues later, remember that driver and CUDA version mismatches are common culprits.

By choosing the appropriate option, you can ensure a smooth installation of the CUDA Toolkit tailored to your needs or circumstances.

@jaylong255
Copy link
Member Author

jay@blast-Breaker:$ sudo ./cuda_12.6.1_560.35.03_linux.run
[sudo] password for jay:
Installation failed. See log at /var/log/cuda-installer.log for details.
jay@blast-Breaker:$ cat /var/log/cuda-installer.log
[INFO]: Driver installation detected by command: apt list --installed | grep -e nvidia-driver-[0-9][0-9][0-9] -e nvidia-[0-9][0-9][0-9]
[INFO]: Cleaning up window
[INFO]: Complete
[INFO]: Checking compiler version...
[INFO]: gcc location: /usr/bin/gcc

[INFO]: gcc version: gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4)

[INFO]: Initializing menu
[INFO]: nvidia-fs.setKOVersion(2.22.3)
[INFO]: Setup complete
[INFO]: Installing: Driver
[INFO]: Installing: 560.35.03
[INFO]: Executing NVIDIA-Linux-x86_64-560.35.03.run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd 2>&1
[INFO]: Finished with code: 256
[ERROR]: Install of driver component failed. Consult the driver log at /var/log/nvidia-installer.log for more details.
[ERROR]: Install of 560.35.03 failed, quitting
jay@blast-Breaker:~$

@jaylong255
Copy link
Member Author

From the log, it appears that the installation of the CUDA Toolkit failed specifically during the driver installation phase. Here's what you can do next:

Step 1: Check Driver Installation Logs

  • The log mentions another log file for more detailed information:
    cat /var/log/nvidia-installer.log
    • Look for specific errors or messages that might give clues about why the driver failed to install.

Step 2: Remove Existing NVIDIA Drivers

  • As previously suggested, the presence of existing NVIDIA drivers might be causing the issue. Remove them:
    sudo apt remove --purge nvidia-*
    sudo apt autoremove
    sudo reboot

Step 3: Install Required Dependencies

  • Ensure all necessary packages are installed:
    sudo apt update
    sudo apt install -y build-essential dkms libssl-dev

Step 4: Rerun the CUDA Installer

  • Try running the CUDA installer again:
    sudo ./cuda_12.6.1_560.35.03_linux.run

Step 5: If Still Fails

  • Check Kernel Compatibility: Ensure your kernel version is compatible with the CUDA driver. You might need to install a specific kernel version or use a driver version that supports your kernel.

    • Check your kernel version:
      uname -r
    • NVIDIA's website has a compatibility matrix; check if your kernel version is listed.
  • Alternative Driver Installation: Sometimes downloading and manually installing the NVIDIA driver from NVIDIA's website before CUDA can help.

    • Download the appropriate .run file for your GPU from NVIDIA's driver download page.
    • Install it:
      sudo ./NVIDIA-Linux-x86_64-<version>.run
    • Follow prompts, then reboot.

Step 6: Verify Installation

  • After successfully installing (either through the CUDA installer or manually):
    • Reboot your system:
      sudo reboot
    • Then, verify:
      nvidia-smi
      • This should list your GPU and driver version if successful.

Additional Tips:

  • Backup: Before removing drivers or kernels, backup important data.

  • Blacklist Nouveau: Sometimes, the Nouveau open-source driver interferes:

    • Open or create /etc/modprobe.d/blacklist-nouveau.conf:
      sudo nano /etc/modprobe.d/blacklist-nouveau.conf
      Add:
      blacklist nouveau
      options nouveau modeset=0
    • Then reboot.
  • Check Logs Again: If you encounter issues after following these steps, re-examine the logs, especially /var/log/syslog or /var/log/dmesg for any NVIDIA or kernel-related errors post-reboot.

If these steps don't resolve the issue, consider checking NVIDIA's forums or support resources for any known issues with your specific hardware or kernel version. Sometimes, specific hardware configurations or kernel versions might have unique requirements or bugs.

@jaylong255
Copy link
Member Author

jay@blast-Breaker:~$ sudo ./cuda_12.6.1_560.35.03_linux.run

= Summary =

Driver: Installed
Toolkit: Installed in /usr/local/cuda-12.6/

Please make sure that

  • PATH includes /usr/local/cuda-12.6/bin
  • LD_LIBRARY_PATH includes /usr/local/cuda-12.6/lib64, or, add /usr/local/cuda-12.6/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.6/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Logfile is /var/log/cuda-installer.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant