Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider updating API design for clarity #61

Closed
markmc opened this issue Jul 1, 2024 · 8 comments
Closed

Consider updating API design for clarity #61

markmc opened this issue Jul 1, 2024 · 8 comments

Comments

@markmc
Copy link
Contributor

markmc commented Jul 1, 2024

Consider the following:

ds = Dataset.from_list(samples)

skills_flow = SynthGroundedSkillsFlow(client, "mixtral", teacher_model).get_flow()
skills_pipe = Pipeline(skills_flow)

sdg = SDG([skills_pipe])
gen_data = sdg.generate(ds)

or:

ds = Dataset.from_list(samples)

mmlu_flow = MMLUBenchFlow(client, teacher_model).get_flow()
mmlu_pipe = Pipeline(mmlu_flow)
knowledge_flow = SynthKnowledgeFlow(client, teacher_model).get_flow()
knowledge_pipe = Pipeline(knowledge_flow)

sdg = SDG([mmlu_pipe, knowledge_pipe])
gen_data = sdg.generate(ds)

Consider the nouns:

  • Dataset - this is from Hugging Face's datasets library
  • Block - not shown in the code above, but required to understand a flow - a block provides a generate() method transforms an input dataset and returns an output dataset
  • Block config - a description of how to instantiate and invoke a block
  • Flow - a class which describes how to render a sequence of block configs from a template
  • Pipeline - a pipeline is created from a sequence of block configs, and provides a generate() method in which it instantiates and invokes blocks in turn, passing the input dataset and collecting the output
  • SDG - an SDG is created from a list of pipelines, and its generate() method calls pipelines in turn

Proposals:

  1. Remove SDG - we don't need both SDG and Pipeline since Pipeline can already do everything SDG can do
  2. Model Flow as a block config template - it would be more clear if we reinforced the idea that a "flow" is a template of a block config sequence - a render() method make sense to me, and an extensible params object for the common case of instantiating multiple flows
  3. Create a pipeline from a sequence of flows - add a Pipeline.from_flows() convenience class method to Pipeline that knows how to render block configs from a sequence of flows

So we could have e.g.

ds = Dataset.from_list(samples)

flow_params = FlowParams(client, "mixtral", teacher_model)

block_configs = SynthGroundedSkillsFlow(flow_params).render(params)

skills_pipe = Pipeline(block_configs)

gen_data = skills_pipe.generate(ds)

or:

ds = Dataset.from_list(samples)

flow_params = FlowParams(client, "mixtral", teacher_model)

block_configs = MMLUBenchFlow(flow_params).render()
block_configs.extend(SynthKnowledgeFlow(flow_params).render())

knowledge_pipe = Pipeline(block_configs)

gen_data = knowledge_pipe.generate(ds)

or:

ds = Dataset.from_list(samples)

flow_params = FlowParams(client, "mixtral", teacher_model)

knowledge_pipe = Pipeline.from_flows([MMLUBenchFlow, SynthKnowledgeFlow], flow_params)

gen_data = knowledge_pipe.generate(ds)

Resolving this issue would require an update to the design doc and a code change

It would definitely be better to do this before users of the API proliferate

@markmc markmc changed the title Consider pruning API nouns for clarity Consider updating API design for clarity Jul 2, 2024
This was referenced Jul 2, 2024
@shivchander
Copy link
Member

shivchander commented Jul 2, 2024

Remove SDG - we don't need both SDG and Pipeline since Pipeline can already do everything SDG can do

While I understand the intent to simplify the system, I believe removing the SDG might not be the best approach for the following reasons:

  1. Separation of Concerns: The SDG serves as a higher-level abstraction that can manage and coordinate multiple pipelines. This separation is important for maintaining clarity in the system design, especially when dealing with complex workflows that involve multiple pipelines. Mixing these responsibilities into a single Pipeline class could reduce the readability and maintainability of the code.

  2. Complex Workflows: In scenarios where we need to coordinate multiple pipelines, the SDG provides a clear and concise way to manage these interactions. It can handle the orchestration of pipelines, ensuring that each one is executed in the correct order and with the correct dependencies. Removing SDG could make it harder to manage these types of workflows effectively.

Currently we only support linear pipeline chaining (and each pipeline only supports linear chaining of the block). But in future we want to move towards supporting composition of complex ways to compose (any DAGs)

@shivchander
Copy link
Member

shivchander commented Jul 2, 2024

ds = Dataset.from_list(samples)

block_params = BlockConfigParams(client, "mixtral", teacher_model)
flow_params = FlowParams(client, "mixtral", teacher_model)

block_configs = MMLUBenchFlow(flow_params).render()
block_configs.extend(SynthKnowledgeFlow(flow_params).render())

I like this proposal to enhance the clarity and convenience

@markmc
Copy link
Contributor Author

markmc commented Jul 3, 2024

Remove SDG - we don't need both SDG and Pipeline since Pipeline can already do everything SDG can do
...
Currently we only support linear pipeline chaining (and each pipeline only supports linear chaining of the block). But in future we want to move towards supporting composition of complex ways to compose (any DAGs)

I can totally see that, but consider the YAGNI principle

"Always implement things when you actually need them, never when you just foresee that you [will] need them."

It's very easy to add this concept later when we actually add support for more complex compositions. And at that point, we can consider an API design in the context of a more concrete proposal.

Right now, this concept adds nothing but confusion - e.g. should you chain together MMLUBenchFlow and SynthKnowledgeFlow in a single pipeline, or make each a pipeline and chain them together with SDG? Eiher!

Also, I don't think SDG is a great name. It's not a noun. It doesn't convey its role in terms of composition/orchestration of pipelines. This is the sort of thing we can get right later when we need to add it.

@markmc
Copy link
Contributor Author

markmc commented Jul 3, 2024

ds = Dataset.from_list(samples)

block_params = BlockConfigParams(client, "mixtral", teacher_model)
flow_params = FlowParams(client, "mixtral", teacher_model)

block_configs = MMLUBenchFlow(flow_params).render()
block_configs.extend(SynthKnowledgeFlow(flow_params).render())

I like this proposal to enhance the clarity and convenience

Thank you.

Just a note - the BlockConfigParams line should not be there, that's just a leftover from my editing.

@markmc
Copy link
Contributor Author

markmc commented Jul 5, 2024

  1. Model Flow as a block config template - it would be more clear if we reinforced the idea that a "flow" is a template of a block config sequence - a render() method make sense to me, and an extensible params object for the common case of instantiating multiple flows

In #86, flows are now YAML files with block_configs

  1. Create a pipeline from a sequence of flows - add a Pipeline.from_flows() convenience class method to Pipeline that knows how to render block configs from a sequence of flows

In #86, this is now PipelineContext

So it now looks like:

ds = Dataset.from_list(samples)

ctx = PipelineContext(client, "mixtral", teacher_model)

knowledge_pipe = Pipeline.from_flows(ctx, [MMLU_BENCH_FLOW, SYNTH_KNOWLEDGE_FLOW])

gen_data = knowledge_pipe.generate(ds)

@markmc
Copy link
Contributor Author

markmc commented Jul 9, 2024

As #86 develops further, I think it's becoming increasingly obvious that the "Flow" noun adds little but confusion to this abstraction

Consider:

Flow - a class which describes how to render a sequence of block configs from a template

In #86, "flows" are now simply config files with a sequence of block configs that can be used to construct a pipeline - they are simply a pipeline description, and we can simply call them that

I think it would be best to bite the bullet now in the early stages of the project and pare the nouns back to Pipeline and Block

markmc added a commit to markmc/dev-docs that referenced this issue Jul 9, 2024
See instructlab/sdg#61 for some of the discussion on this, but I've
tried to summarize those discussions in the design doc.

Signed-off-by: Mark McLoughlin <[email protected]>
@markmc
Copy link
Contributor Author

markmc commented Jul 9, 2024

Drafted a design doc for this - see instructlab/dev-docs#113

markmc added a commit to markmc/dev-docs that referenced this issue Jul 9, 2024
See instructlab/sdg#61 for some of the discussion on this, but I've
tried to summarize those discussions in the design doc.

Signed-off-by: Mark McLoughlin <[email protected]>
@markmc
Copy link
Contributor Author

markmc commented Jul 30, 2024

All done now.

@markmc markmc closed this as completed Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants