RHEL AI POC

This repository contains the code and documentation for creating RHEL AI Proofs of Concept (POCs). This aims to be straightforward and comply with the available documentation for RHEL AI and InstructLab. The goal is to provide a simple and easy to follow guide for creating successful and reproducible POCs that can be used to demonstrate the capabilities of RHEL AI.

Much of what is here, are some nuance and gotchas that may not be covered in the main documentation. There are some included scripts and automation, but many of these are anticipated to be removed as the features are integrated into the main RHEL AI product.

Prerequisites

Installation

Official Installation Documentation

Bare Metal

TBD

IBM Cloud

TBD

AWS

Creating an AWS AMI

Launching an AWS Instance

Initial Configuration

Document Collection

PDFs

The most common data to use for training a custom model, is PDF formatted. In the directory in document_collection add your pdfs. These will be used to create markdown versions of the data suitable for the RHEL AI synthetic data generation. During the data prepration step, we will convert the PDFs to markdown. By default, markdown, broken in chunks will be created in the output folder. You will refer to these generated markdown files from your "qna.yaml" file.

Markdown

If your data is in markdown format already, you can refer to markdown documents directly in the qna.yaml

qna.yaml

Official Documentation - Customizing your taxonomy tree

The qna.yaml file is a simple yaml file that contains the questions and answers for the document. This file is used in the synthetic data generation process to create the training data for the model. They should be placed in the appropriate directory structure in the taxonomy folder. (e.g. taxonomy/knowledge/my_subject/qna.yaml). At the end of this process, you should be able to copy this directory structure as needed for the POC either to upload to the cloud service or the RHEL AI instance used in the POC.

TBD qna.yaml best practices and gotchas.

Evaluation Questions

It is also useful to obtain a set of human generated questions and answers to judge the model. These questions will be held separate from the model training data and used to evaluate the model. These questions and "ground truth" answers will be used in our evaluation process to evaluate how our model is doing. You can format these questions in a CSV format with the columns "question" and "ground_truth" and place them in the eval/qna directory. In addition you can use the same format as a qna.yaml. This is useful if you would like to use a qna.yaml for evaluation as a trial run or stand in.

seed_examples:
  - questions_and_answers:
      - question: >
          relevant question
        answer: >
          reference / ground truth answer
      - question: >
          relevant question
        answer: >
          reference / ground truth answer

Data Preparation

PDF to Markdown Conversion

RHEL AI Official Doc

Since RHEL AI needs markdown of the knowledge data in for synthetic data generation (SDG), the first step is to convert the PDF to a format that is easily digestible. For this we will use Docling to convert the PDF to chunks of markdown text. The code to do so is a work in progress and will be , but for now, we've taken the current versions of the SDG conversion scripts and added them here as a workaround.

Python script

In the document_chunker.py python script will convert pdfs to chunks of markdown ready for SDG. Enable your python environment and switch to the data_preparation folder:

source venv/bin/activate
cd data_preparation

Then run the script a command, such as the example:

pip install -r requirements.txt
python document_chunker.py --input-dir document_collection --output-dir output

Notebook

In the event you would like to see the steps of the process or perhaps customize the process, you can convert the PDF using a simple notebook: data_preparation/pdf_conversion.ipynb.

Next run the cells in the notebook to convert all PDFs from the document_collection directory to markdown. The notebook will create an output directory for the converted markdown files, copied taxonomy, and some intermediate files. The markdown has been broken up in to chunks for better context for the synthetic data generation process. The markdown files will be named document_1.md, document_2.md.

Commit the changes

Commit the changes to the OUTPUT_DIR to the repository. Take note of the markdown locations and the commit hash for the training data, and push it up to the repository.

Update the your qna.yaml files

Official Documentation - Creating a konwledge YAML file For the copied taxonomy in output/taxonomy, you need to refer to your output markdown files and the commit hash in the document section.

If the repository is public, simply refer to the repo URL, commit hash, and the path to markdown files in the qna.yaml file. For example:

document:
  repo: 'https://github.com/user/poc-repo'
  commit: 50e47897b5d5bb359504618fa33a83110e87f5f8
  patterns:
    - 'data_preparation/output/knowledge/my_subject/md/*.md'

After updating your qna.yaml files in this project, you can test them out like so:

ilab taxonomy diff --taxonomy-path ./output/taxonomy

After you've validated the qna.yaml files, you can commit them to the taxonomy repository and push

NOTE: file references In the case of a private repository or a repo that only exists on the RHEL AI server, you can refer to the documents using the file:// protocol. For example:

document:
  repo: 'file:///home/example-user/poc-repo'
  commit: 50e47897b5d5bb359504618fa33a83110e87f5f8
  patterns:
    - 'data_preparation/output/knowledge/my_subject/md/*.md'

Update the taxonomy on the RHEL AI instance with the new qna.yaml file.

Now you can copy the output taxonomy folder and place it in the appropriate location on the RHEL AI instance.

rm -rf ~/.local/share/instructlab/taxonomy
cp -r <poc-repo>/data_preparation/taxonomy ~/.local/share/instructlab/

After the files are copied, you can do a final check on the taxonomy before proceeding.

ilab taxonomy diff

Synthetic Data Generation

Now, that we've finished preparing the documents and taxonomy, we're ready to generate the synthetic data. This is done using the ilab generate command. This command will take the markdown files and the qna.yaml file and generate the synthetic data for training the model.

IBM Cloud

TBD

RHEL AI Cluster

You can run the command in a background task or tmux to make sure it runs to completion. In addition, it is customized to allow each of the previously generated markdown chunks to be consumed whole during the SDG process.

ilab data generate --chunk-word-count 10000 --server-ctx-size 16384

If you want to run it in the background, so the process does not get interrupted:

nohup ilab data generate --chunk-word-count 10000 --server-ctx-size 16384 > generate.log 2>&1 
tail -f generate.log

The synthetic data will be in a directory ~/.local/share/instructlab/datasets/ and be named skills_train_msgs...jsonl and knowledge_train_msgs...jsonl. These files will be used to train the model in the next step.

Model Training

IBM Cloud

TBD

RHEL AI Cluster

ilab model train --strategy lab-multiphase \
  --phased-phase1-data ~/.local/share/instructlab/datasets/knowledge_train_msgs_2024-11-08T22_55_40.jsonl \
  --phased-phase2-data ~/.local/share/instructlab/datasets/skills_train_msgs_2024-11-08T22_55_40.jsonl

When trying to run ilab train in a background process, you'll need to add -y in order to skip the interactive prompts and avoid the process not progressing. For example:

nohup ilab model train -y \
  --strategy lab-multiphase \
  --phased-phase1-data ~/.local/share/instructlab/datasets/knowledge_train_msgs_2024-11-08T22_55_40.jsonl \
  --phased-phase2-data ~/.local/share/instructlab/datasets/skills_train_msgs_2024-11-08T22_55_40.jsonl \
  > training.log 2>&1 &

tail -f training.log

Deployment and Testing

To test out and evaluate the model for demo purposes, first we need to serve the model. After training you will have received a message in the training output like the following:

Training finished! Best final checkpoint: <path-to-best-performed-checkpoint> with score: 6.968152866242038

This is the model checkpoint we'll serve.

RHEL AI Cluster

Official Documentation - Serving and chatting with your new model

Once the model training is complete

ilab model serve --model-path <path-to-best-performed-checkpoint>

OpenShift AI Serving

Official Documentation - Serving large models with OpenShift AI

If you choose to serve the model with OpenShift AI, this will give you the flexibility to open up the model as an endpoint secured by an API key. You will also be able to build applications (such as a RAG chatbot) for demo purposes.

Saving the model to object storage

To serve the model with OpenShift AI, you will need to save the model to an object storage bucket that is compatible with the S3 API. You can edit and use the python script misc/s3_uploader if you wish.

Kserve and vLLM

Once the model is in storage, you can add an OpenShift AI data connection to the storage bucket. Once the connection is established, you can deploy it with the vLLM serving image. The model will be served as an endpoint that you can access with an API key.

Evaluation

Once the endpoint is served, you can access it from the evaluation notebooks. The notebook eval/llm_judge_eval includes basic tests of the models, and an evaluation of the models with an llm as a judge. There's also an experimental ragas notebook, eval/ragas_eval.

Before running the notebooks, configure a yaml file eval/llm_config.yaml. As a guide, you can use the eval/llm_config_example.yaml file. This file will contain the endpoint and API key for the model you are testing.

name: gpt-4-eval
judge:
  model_name: gpt-4
  api_key: <API_KEY>
  template: |
    <judge_prompt_template>
testing_config:
  - name: finetuned
    endpoint_url: <ENDPOINT_URL>
    model_name: finetuned
    api_key: <API_KEY>
    qna_template: |
      <prompt_template>
    rag_template: |
      <rag_prompt_template>
  - name: comparison
    endpoint_url: <ENDPOINT_URL>
    model_name: plain-llm
    api_key: <API_KEY>
    rag_template: |
      <rag_prompt_template>

Once your llm_config.yaml is configured, you can run the evaluation notebooks. The notebooks will evaluate the model with the questions and answers in the eval/qna directory. The notebooks will output the results of the evaluation and the performance of the model in CSV and Excel format.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data_preparation		data_preparation
eval		eval
misc		misc
model_deployment/rhoai		model_deployment/rhoai
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

rh-aiservices-bu/rhel-ai-poc

Folders and files

Latest commit

History

Repository files navigation

RHEL AI POC

Table of Contents

Suggested documentation

Prerequisites

Installation

Bare Metal

IBM Cloud

AWS

Creating an AWS AMI

Launching an AWS Instance

Initial Configuration

Document Collection

PDFs

Markdown

qna.yaml

Evaluation Questions

Data Preparation

PDF to Markdown Conversion

Python script

Notebook

Commit the changes

Update the your qna.yaml files

Update the taxonomy on the RHEL AI instance with the new qna.yaml file.

Synthetic Data Generation

IBM Cloud

RHEL AI Cluster

Model Training

IBM Cloud

RHEL AI Cluster

Deployment and Testing

RHEL AI Cluster

OpenShift AI Serving

Saving the model to object storage

Kserve and vLLM

Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages