Use new API in generate_data

This makes use of the new SDG API under the generate_data() method used by the CLI. It uses new simple workflows for knowlege and skills that inteded for basic usable with a small model for testing and demo purposes. The full pipelines provided in the library will only work in larger environments capable of running Mixtral-8x7b. There are still various TODOs in the code, but this is enough to start with. I'm sure we will make enhancements to these basic workflows that still work for the small environments. Signed-off-by: Russell Bryant <[email protected]>
instructlab · Jun 28, 2024 · 9ee7f70 · 9ee7f70
1 parent 0207156
commit 9ee7f70
Show file tree

Hide file tree

Showing 11 changed files with 385 additions and 782 deletions.
diff --git a/.pylintrc b/.pylintrc
@@ -444,7 +444,6 @@ disable=raw-checker-failed,
         logging-too-many-args,
         attribute-defined-outside-init,
         abstract-method,
-        pointless-statement,
         wrong-import-order,
         line-too-long,
         logging-fstring-interpolation

diff --git a/pyproject.toml b/pyproject.toml
@@ -95,7 +95,6 @@ known-local-folder = ["tuning"]
 disable_error_code = ["import-not-found", "import-untyped"]
 exclude = [
     "^src/instructlab/sdg/generate_data\\.py$",
-    "^src/instructlab/sdg/utils/openai\\.py$",
     "^src/instructlab/sdg/utils/taxonomy\\.py$",
     "^src/instructlab/sdg/default_flows\\.py$",
     "^src/instructlab/sdg/llmblock\\.py$",

diff --git a/src/instructlab/sdg/configs/knowledge/simple_generate_qa.yaml b/src/instructlab/sdg/configs/knowledge/simple_generate_qa.yaml
@@ -28,10 +28,7 @@ examples: |
   {document}
 
 generation: |
-  Provide a single question and answer pair based on the document:
-
-  Document:
-  {{document}}
+  Provide a single question and answer pair based on the document.
 
 start_tags: [""]
 end_tags: [""]
diff --git a/src/instructlab/sdg/configs/skills/simple_generate_qa_freeform.yaml b/src/instructlab/sdg/configs/skills/simple_generate_qa_freeform.yaml
@@ -0,0 +1,33 @@
+system: You are a very knowledgeable AI Assistant that will faithfully assist the user with their task.
+
+introduction: Develop a series of question and answer pairs to perform a task.
+
+principles: |
+Here are the requirements:
+  1. Try not to repeat the verb for each instruction to maximize diversity.
+  2. The language used for the instruction also should be diverse. For example, you should combine questions with imperative instructions.
+  3. The type of instructions should be similar to provided examples. The generated instruction and the output should be grounded in the provided document.
+  4. A GPT language model should be able to complete the instruction. For example, do not ask the assistant to create any visual or audio output. For another example, do not ask the assistant to wake you up at 5pm or set a reminder because it cannot perform any action.
+  5. The instructions should be in English.
+  6. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted.
+  7. The output should be an appropriate response to the input and the instruction. Long outputs are preferable.
+
+examples: |
+  The task is {task_description}.
+
+  Here are some examples to help you understand the type of questions that are asked for:
+
+  {question_1}
+  {response_1}
+
+  {question_2}
+  {response_2}
+
+  {question_3}
+  {response_3}
+
+generation: |
+  Provide a single question and answer pair based on the examples.
+
+start_tags: [""]
+end_tags: [""]
diff --git a/src/instructlab/sdg/configs/skills/simple_generate_qa_grounded.yaml b/src/instructlab/sdg/configs/skills/simple_generate_qa_grounded.yaml
@@ -0,0 +1,37 @@
+system: You are a very knowledgeable AI Assistant that will faithfully assist the user with their task.
+
+introduction: Develop a series of question and answer pairs to perform a task.
+
+principles: |
+Here are the requirements:
+  1. Try not to repeat the verb for each instruction to maximize diversity.
+  2. The language used for the instruction also should be diverse. For example, you should combine questions with imperative instructions.
+  3. The type of instructions should be similar to provided examples. The generated instruction and the output should be grounded in the provided document.
+  4. A GPT language model should be able to complete the instruction. For example, do not ask the assistant to create any visual or audio output. For another example, do not ask the assistant to wake you up at 5pm or set a reminder because it cannot perform any action.
+  5. The instructions should be in English.
+  6. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted.
+  7. The output should be an appropriate response to the input and the instruction. Long outputs are preferable.
+
+examples: |
+  The task is {task_description}.
+
+  Here is some context for the example questions:
+
+  {context}
+
+  Here are some examples to help you understand the type of questions that are asked for:
+
+  {question_1}
+  {response_1}
+
+  {question_2}
+  {response_2}
+
+  {question_3}
+  {response_3}
+
+generation: |
+  Provide a single question and answer pair based on the examples.
+
+start_tags: [""]
+end_tags: [""]
diff --git a/src/instructlab/sdg/default_flows.py b/src/instructlab/sdg/default_flows.py
@@ -40,17 +40,15 @@ def get_flow(self) -> list:
         pass
 
 
-class SimpleKnowledgeFlow(Flow):
+class _SimpleFlow(Flow):
     def get_flow(self) -> list:
         sdg_base = resources.files(__package__)
         return [
             {
                 "block_type": LLMBlock,
                 "block_config": {
-                    "block_name": "gen_knowledge",
-                    "config_path": os.path.join(
-                        sdg_base, "configs/knowledge/simple_generate_qa.yaml"
-                    ),
+                    "block_name": "",  # must be set by subclass
+                    "config_path": "",  # must be set by subclass
                     "client": self.client,
                     "model_id": self.model_id,
                     "model_prompt": _get_model_prompt(self.model_family),
@@ -68,6 +66,39 @@ def get_flow(self) -> list:
         ]
 
 
+class SimpleKnowledgeFlow(_SimpleFlow):
+    def get_flow(self) -> list:
+        flow = super().get_flow()
+        sdg_base = resources.files(__package__)
+        flow[0]["block_config"]["config_path"] = os.path.join(
+            sdg_base, "configs/knowledge/simple_generate_qa.yaml"
+        )
+        flow[0]["block_config"]["block_name"] = "gen_knowledge"
+        return flow
+
+
+class SimpleFreeformSkillFlow(_SimpleFlow):
+    def get_flow(self) -> list:
+        flow = super().get_flow()
+        sdg_base = resources.files(__package__)
+        flow[0]["block_config"]["config_path"] = os.path.join(
+            sdg_base, "configs/skills/simple_generate_qa_freeform.yaml"
+        )
+        flow[0]["block_config"]["block_name"] = "gen_skill_freeform"
+        return flow
+
+
+class SimpleGroundedSkillFlow(_SimpleFlow):
+    def get_flow(self) -> list:
+        flow = super().get_flow()
+        sdg_base = resources.files(__package__)
+        flow[0]["block_config"]["config_path"] = os.path.join(
+            sdg_base, "configs/skills/simple_generate_qa_grounded.yaml"
+        )
+        flow[0]["block_config"]["block_name"] = "gen_skill_grounded"
+        return flow
+
+
 class MMLUBenchFlow(Flow):
     def get_flow(self) -> list:
         sdg_base = resources.files(__package__)