Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce v3 schema version to support new knowledge format #38

Closed
Tracked by #160
russellb opened this issue Jul 17, 2024 · 0 comments
Closed
Tracked by #160

Introduce v3 schema version to support new knowledge format #38

russellb opened this issue Jul 17, 2024 · 0 comments
Assignees
Milestone

Comments

@russellb
Copy link
Member

This is part of: instructlab/sdg#160

This is an example of the new format: https://github.com/instructlab/taxonomy/blob/7729fcd62ca68e36225a98a954e702734cc09ae1/knowledge/science/anatomy/tonsils/qna.yaml

The key changes are:

  • Sample q&a pairs now have context associated with them.
  • A new document_outline field has been added.
@russellb russellb self-assigned this Jul 17, 2024
@russellb russellb added this to the v0.3.0 milestone Jul 17, 2024
russellb added a commit to russellb/instructlab-schema that referenced this issue Jul 17, 2024
Closes instructlab#38

v3 includes some backwards incompatible changes to the knowledge
schema format. Here is a diff against v2. The changes are:

- Q&A pairs now have an associated context blob from the knowledge
  document.

- There is new "document_outline" field.

```diff
--- src/instructlab/schema/v2/knowledge.json    2024-07-17 12:56:37
+++ src/instructlab/schema/v3/knowledge.json    2024-07-17 13:14:56
@@ -8,7 +8,8 @@
         "domain",
         "task_description",
         "seed_examples",
-        "document"
+        "document",
+        "document_outline"
     ],
     "unevaluatedProperties": false,
     "properties": {
@@ -44,20 +45,37 @@
             "items": {
                 "type": "object",
                 "required": [
-                    "question",
-                    "answer"
+                    "context",
+                    "questions_and_answers"
                 ],
                 "unevaluatedProperties": false,
                 "properties": {
-                    "question": {
-                        "description": "A question used for synthetic data generation.",
+                    "context": {
+                        "description": "A context used for synthetic data generation.",
                         "type": "string",
                         "minLength": 1
                     },
-                    "answer": {
-                        "description": "The desired response for the question.",
-                        "type": "string",
-                        "minLength": 1
+                    "questions_and_answers": {
+                        "type": "array",
+                        "items": {
+                            "type": "object",
+                            "required": [
+                                "question",
+                                "answer"
+                            ],
+                            "properties": {
+                                "question": {
+                                    "description": "A question used for synthetic data generation.",
+                                    "type": "string",
+                                    "minLength": 1
+                                },
+                                "answer": {
+                                    "description": "The desired response for the question.",
+                                    "type": "string",
+                                    "minLength": 1
+                                }
+                            }
+                        }
                     }
                 }
             }
@@ -104,6 +122,11 @@
                     }
                 }
             }
+        },
+        "document_outline": {
+            "description": "An outline of the document.",
+            "type": "string",
+            "minLength": 1
         }
     }
 }
```

Signed-off-by: Russell Bryant <[email protected]>
russellb added a commit to russellb/instructlab-schema that referenced this issue Jul 17, 2024
Closes instructlab#38

v3 includes some backwards incompatible changes to the knowledge
schema format. Here is a diff against v2. The changes are:

- Q&A pairs now have an associated context blob from the knowledge
  document.

- There is new "document_outline" field.

```diff
--- src/instructlab/schema/v2/knowledge.json    2024-07-17 12:56:37
+++ src/instructlab/schema/v3/knowledge.json    2024-07-17 13:14:56
@@ -8,7 +8,8 @@
         "domain",
         "task_description",
         "seed_examples",
-        "document"
+        "document",
+        "document_outline"
     ],
     "unevaluatedProperties": false,
     "properties": {
@@ -44,20 +45,37 @@
             "items": {
                 "type": "object",
                 "required": [
-                    "question",
-                    "answer"
+                    "context",
+                    "questions_and_answers"
                 ],
                 "unevaluatedProperties": false,
                 "properties": {
-                    "question": {
-                        "description": "A question used for synthetic data generation.",
+                    "context": {
+                        "description": "A context used for synthetic data generation.",
                         "type": "string",
                         "minLength": 1
                     },
-                    "answer": {
-                        "description": "The desired response for the question.",
-                        "type": "string",
-                        "minLength": 1
+                    "questions_and_answers": {
+                        "type": "array",
+                        "minItems": 3,
+                        "uniqueItems": true,
+                        "items": {
+                            "type": "object",
+                            "required": [
+                                "question",
+                                "answer"
+                            ],
+                            "properties": {
+                                "question": {
+                                    "description": "A question used for synthetic data generation.",
+                                    "type": "string",
+                                    "minLength": 1
+                                },
+                                "answer": {
+                                    "description": "The desired response for the question.",
+                                    "type": "string",
+                                    "minLength": 1
+                                }
+                            }
+                        }
                     }
                 }
             }
@@ -104,6 +122,11 @@
                     }
                 }
             }
+        },
+        "document_outline": {
+            "description": "An outline of the document.",
+            "type": "string",
+            "minLength": 1
         }
     }
 }
```

Signed-off-by: Russell Bryant <[email protected]>
russellb added a commit to russellb/instructlab-schema that referenced this issue Jul 17, 2024
Closes instructlab#38

v3 includes some backwards incompatible changes to the knowledge
schema format. Here is a diff against v2. The changes are:

- Q&A pairs now have an associated context blob from the knowledge
  document.

- There is new "document_outline" field.

```diff
--- src/instructlab/schema/v2/knowledge.json    2024-07-17 12:56:37
+++ src/instructlab/schema/v3/knowledge.json    2024-07-17 13:14:56
@@ -8,7 +8,8 @@
         "domain",
         "task_description",
         "seed_examples",
-        "document"
+        "document",
+        "document_outline"
     ],
     "unevaluatedProperties": false,
     "properties": {
@@ -44,20 +45,37 @@
             "items": {
                 "type": "object",
                 "required": [
-                    "question",
-                    "answer"
+                    "context",
+                    "questions_and_answers"
                 ],
                 "unevaluatedProperties": false,
                 "properties": {
-                    "question": {
-                        "description": "A question used for synthetic data generation.",
+                    "context": {
+                        "description": "A context used for synthetic data generation.",
                         "type": "string",
                         "minLength": 1
                     },
-                    "answer": {
-                        "description": "The desired response for the question.",
-                        "type": "string",
-                        "minLength": 1
+                    "questions_and_answers": {
+                        "type": "array",
+                        "minItems": 3,
+                        "uniqueItems": true,
+                        "items": {
+                            "type": "object",
+                            "required": [
+                                "question",
+                                "answer"
+                            ],
+                            "properties": {
+                                "question": {
+                                    "description": "A question used for synthetic data generation.",
+                                    "type": "string",
+                                    "minLength": 1
+                                },
+                                "answer": {
+                                    "description": "The desired response for the question.",
+                                    "type": "string",
+                                    "minLength": 1
+                                }
+                            }
+                        }
                     }
                 }
             }
@@ -104,6 +122,11 @@
                     }
                 }
             }
+        },
+        "document_outline": {
+            "description": "An outline of the document.",
+            "type": "string",
+            "minLength": 1
         }
     }
 }
```

Signed-off-by: Russell Bryant <[email protected]>
russellb added a commit to russellb/instructlab-schema that referenced this issue Jul 17, 2024
Closes instructlab#38

v3 includes some backwards incompatible changes to the knowledge
schema format. Here is a diff against v2. The changes are:

- Q&A pairs now have an associated context blob from the knowledge
  document.

- There is new `document_outline` field.

- drop `task_description`

```diff

--- src/instructlab/schema/v2/knowledge.json	2024-07-17 12:56:37
+++ src/instructlab/schema/v3/knowledge.json	2024-07-17 15:38:30
@@ -6,9 +6,9 @@
     "required": [
         "created_by",
         "domain",
-        "task_description",
         "seed_examples",
-        "document"
+        "document",
+        "document_outline"
     ],
     "unevaluatedProperties": false,
     "properties": {
@@ -27,15 +27,6 @@
                 "Pop culture"
             ]
         },
-        "task_description": {
-            "description": "A description of the task which is used in prompts to the teacher model during synthetic data generation. The description should be detailed and prescriptive to improve the teacher model's responses.",
-            "type": "string",
-            "minLength": 1,
-            "examples": [
-                "To teach a language model about softball history",
-                "To teach a language model about tabby cats"
-            ]
-        },
         "seed_examples": {
             "description": "An array of seed examples for synthetic data generation.",
             "type": "array",
@@ -44,20 +35,39 @@
             "items": {
                 "type": "object",
                 "required": [
-                    "question",
-                    "answer"
+                    "context",
+                    "questions_and_answers"
                 ],
                 "unevaluatedProperties": false,
                 "properties": {
-                    "question": {
-                        "description": "A question used for synthetic data generation.",
+                    "context": {
+                        "description": "Context from the document associated with this set of sample q&a pairs.",
                         "type": "string",
                         "minLength": 1
                     },
-                    "answer": {
-                        "description": "The desired response for the question.",
-                        "type": "string",
-                        "minLength": 1
+                    "questions_and_answers": {
+                        "type": "array",
+                        "minItems": 3,
+                        "uniqueItems": true,
+                        "items": {
+                            "type": "object",
+                            "required": [
+                                "question",
+                                "answer"
+                            ],
+                            "properties": {
+                                "question": {
+                                    "description": "A question used for synthetic data generation.",
+                                    "type": "string",
+                                    "minLength": 1
+                                },
+                                "answer": {
+                                    "description": "The desired response for the question.",
+                                    "type": "string",
+                                    "minLength": 1
+                                }
+                            }
+                        }
                     }
                 }
             }
@@ -104,6 +114,11 @@
                     }
                 }
             }
+        },
+        "document_outline": {
+            "description": "An outline of the document.",
+            "type": "string",
+            "minLength": 1
         }
     }
 }
```

Signed-off-by: Russell Bryant <[email protected]>
russellb added a commit to russellb/instructlab-schema that referenced this issue Jul 17, 2024
Closes instructlab#38

v3 includes some backwards incompatible changes to the knowledge
schema format. Here is a diff against v2. The changes are:

- Q&A pairs now have an associated context blob from the knowledge
  document.

- There is new `document_outline` field.

- drop `task_description`

```diff

--- src/instructlab/schema/v2/knowledge.json	2024-07-17 12:56:37
+++ src/instructlab/schema/v3/knowledge.json	2024-07-17 15:38:30
@@ -6,9 +6,9 @@
     "required": [
         "created_by",
         "domain",
-        "task_description",
         "seed_examples",
-        "document"
+        "document",
+        "document_outline"
     ],
     "unevaluatedProperties": false,
     "properties": {
@@ -27,15 +27,6 @@
                 "Pop culture"
             ]
         },
-        "task_description": {
-            "description": "A description of the task which is used in prompts to the teacher model during synthetic data generation. The description should be detailed and prescriptive to improve the teacher model's responses.",
-            "type": "string",
-            "minLength": 1,
-            "examples": [
-                "To teach a language model about softball history",
-                "To teach a language model about tabby cats"
-            ]
-        },
         "seed_examples": {
             "description": "An array of seed examples for synthetic data generation.",
             "type": "array",
@@ -44,20 +35,39 @@
             "items": {
                 "type": "object",
                 "required": [
-                    "question",
-                    "answer"
+                    "context",
+                    "questions_and_answers"
                 ],
                 "unevaluatedProperties": false,
                 "properties": {
-                    "question": {
-                        "description": "A question used for synthetic data generation.",
+                    "context": {
+                        "description": "Context from the document associated with this set of sample q&a pairs.",
                         "type": "string",
                         "minLength": 1
                     },
-                    "answer": {
-                        "description": "The desired response for the question.",
-                        "type": "string",
-                        "minLength": 1
+                    "questions_and_answers": {
+                        "type": "array",
+                        "minItems": 3,
+                        "uniqueItems": true,
+                        "items": {
+                            "type": "object",
+                            "required": [
+                                "question",
+                                "answer"
+                            ],
+                            "properties": {
+                                "question": {
+                                    "description": "A question used for synthetic data generation.",
+                                    "type": "string",
+                                    "minLength": 1
+                                },
+                                "answer": {
+                                    "description": "The desired response for the question.",
+                                    "type": "string",
+                                    "minLength": 1
+                                }
+                            }
+                        }
                     }
                 }
             }
@@ -104,6 +114,11 @@
                     }
                 }
             }
+        },
+        "document_outline": {
+            "description": "An outline of the document.",
+            "type": "string",
+            "minLength": 1
         }
     }
 }
```

Signed-off-by: Russell Bryant <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant