Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick fixes to make large scale testing work #3

Merged
merged 6 commits into from
Jul 11, 2024
Merged

Conversation

shivchander
Copy link
Collaborator

No description provided.

aakankshaduggal and others added 5 commits July 9, 2024 14:43
The max_tokens value in llmblock.py was updated from 12000 to 4096 to optimize the performance of the LLM server.

Signed-off-by: shiv <[email protected]>
…tom error incase of empty dataset in midldle of a pipeline

Signed-off-by: shiv <[email protected]>
Copy link
Owner

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks @shivchander 🚢

Attaching the output for your reference.

ilab data generate --endpoint-url http://localhost:3000/v1 --model mistralai/Mixtral-8x7B-Instruct-v0.1 --num-instructions 1 --api-key EMPTY --model-family mixtral --pipeline full
INFO 2024-07-10 15:44:44,893 utils.py:161: _init_num_threads NumExpr defaulting to 10 threads.
INFO 2024-07-10 15:44:45,005 config.py:58: <module> PyTorch version 2.3.1 available.
Generating synthetic data using 'mistralai/Mixtral-8x7B-Instruct-v0.1' model, taxonomy:'taxonomy' against http://localhost:3000/v1 server
INFO 2024-07-10 15:44:47,123 generate_data.py:259: generate_data Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-07-10 15:44:47,323 llmblock.py:32: server_supports_batched LLM server supports batched inputs: True
INFO 2024-07-10 15:44:47,323 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:44:47,323 pipeline.py:47: generate Running block: gen_questions
INFO 2024-07-10 15:44:47,323 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response'],
    num_rows: 6
})
INFO 2024-07-10 15:44:47,997 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question'],
    num_rows: 6
})
INFO 2024-07-10 15:44:47,997 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:44:47,999 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:44:47,999 pipeline.py:47: generate Running block: eval_questions
INFO 2024-07-10 15:44:47,999 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question'],
    num_rows: 6
})
INFO 2024-07-10 15:44:49,427 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question', 'evaluation', 'score'],
    num_rows: 6
})
INFO 2024-07-10 15:44:49,428 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:44:49,428 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:44:49,428 pipeline.py:47: generate Running block: filter_questions
INFO 2024-07-10 15:44:49,428 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question', 'evaluation', 'score'],
    num_rows: 6
})
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 1237.14 examples/s]
Filter: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 4524.60 examples/s]
INFO 2024-07-10 15:44:49,446 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question'],
    num_rows: 6
})
INFO 2024-07-10 15:44:49,446 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:44:49,448 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:44:49,448 pipeline.py:47: generate Running block: gen_responses
INFO 2024-07-10 15:44:49,448 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question'],
    num_rows: 6
})
INFO 2024-07-10 15:44:50,243 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response'],
    num_rows: 6
})
INFO 2024-07-10 15:44:50,243 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:44:50,245 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:44:50,245 pipeline.py:47: generate Running block: evaluate_qa_pair
INFO 2024-07-10 15:44:50,245 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response'],
    num_rows: 6
})
INFO 2024-07-10 15:44:52,704 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response', 'evaluation', 'score'],
    num_rows: 6
})
INFO 2024-07-10 15:44:52,704 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:44:52,704 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:44:52,704 pipeline.py:47: generate Running block: filter_qa_pair
INFO 2024-07-10 15:44:52,704 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response', 'evaluation', 'score'],
    num_rows: 6
})
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 1936.73 examples/s]
Filter: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 3881.22 examples/s]
INFO 2024-07-10 15:44:52,713 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response'],
    num_rows: 6
})
INFO 2024-07-10 15:44:52,713 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:44:52,713 generate_data.py:286: generate_data Generated 1 samples
INFO 2024-07-10 15:44:52,718 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:44:52,718 pipeline.py:47: generate Running block: gen_contexts
INFO 2024-07-10 15:44:52,718 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_context', 'seed_question', 'seed_response'],
    num_rows: 5
})
INFO 2024-07-10 15:44:56,598 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_context', 'seed_question', 'seed_response', 'context'],
    num_rows: 5
})
INFO 2024-07-10 15:44:56,598 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:44:56,600 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:44:56,600 pipeline.py:47: generate Running block: gen_grounded_questions
INFO 2024-07-10 15:44:56,600 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_context', 'seed_question', 'seed_response', 'context'],
    num_rows: 5
})
INFO 2024-07-10 15:45:01,204 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_context', 'seed_question', 'seed_response', 'context', 'num_samples', 'question'],
    num_rows: 12
})
INFO 2024-07-10 15:45:01,204 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:01,207 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:01,207 pipeline.py:47: generate Running block: eval_grounded_questions
INFO 2024-07-10 15:45:01,207 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_context', 'seed_question', 'seed_response', 'context', 'num_samples', 'question'],
    num_rows: 12
})
INFO 2024-07-10 15:45:04,379 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_context', 'seed_question', 'seed_response', 'context', 'num_samples', 'question', 'evaluation', 'score'],
    num_rows: 12
})
INFO 2024-07-10 15:45:04,380 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:04,380 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:04,380 pipeline.py:47: generate Running block: filter_grounded_questions
INFO 2024-07-10 15:45:04,380 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_context', 'seed_question', 'seed_response', 'context', 'num_samples', 'question', 'evaluation', 'score'],
    num_rows: 12
})
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 3076.13 examples/s]
Filter: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 8264.64 examples/s]
INFO 2024-07-10 15:45:04,389 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_context', 'seed_question', 'seed_response', 'context', 'question'],
    num_rows: 10
})
INFO 2024-07-10 15:45:04,389 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:04,390 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:04,390 pipeline.py:47: generate Running block: gen_grounded_responses
INFO 2024-07-10 15:45:04,390 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_context', 'seed_question', 'seed_response', 'context', 'question'],
    num_rows: 10
})
INFO 2024-07-10 15:45:05,908 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_context', 'seed_question', 'seed_response', 'context', 'question', 'response'],
    num_rows: 10
})
INFO 2024-07-10 15:45:05,908 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:05,911 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:05,911 pipeline.py:47: generate Running block: evaluate_grounded_qa_pair
INFO 2024-07-10 15:45:05,911 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_context', 'seed_question', 'seed_response', 'context', 'question', 'response'],
    num_rows: 10
})
INFO 2024-07-10 15:45:07,548 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_context', 'seed_question', 'seed_response', 'context', 'question', 'response', 'evaluation', 'score'],
    num_rows: 10
})
INFO 2024-07-10 15:45:07,548 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:07,548 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:07,548 pipeline.py:47: generate Running block: filter_grounded_qa_pair
INFO 2024-07-10 15:45:07,548 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_context', 'seed_question', 'seed_response', 'context', 'question', 'response', 'evaluation', 'score'],
    num_rows: 10
})
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 2562.82 examples/s]
Filter: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 6235.03 examples/s]
INFO 2024-07-10 15:45:07,557 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_context', 'seed_question', 'seed_response', 'context', 'question', 'response', 'evaluation', 'score'],
    num_rows: 10
})
INFO 2024-07-10 15:45:07,557 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:07,557 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:07,557 pipeline.py:47: generate Running block: combine_question_and_context
INFO 2024-07-10 15:45:07,557 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_context', 'seed_question', 'seed_response', 'context', 'question', 'response', 'evaluation', 'score'],
    num_rows: 10
})
Map (num_proc=8): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 88.16 examples/s]
INFO 2024-07-10 15:45:07,703 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_context', 'seed_question', 'seed_response', 'context', 'question', 'response', 'evaluation', 'score'],
    num_rows: 10
})
INFO 2024-07-10 15:45:07,703 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:07,703 generate_data.py:286: generate_data Generated 2 samples
INFO 2024-07-10 15:45:07,706 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:07,706 pipeline.py:47: generate Running block: gen_questions
INFO 2024-07-10 15:45:07,706 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response'],
    num_rows: 5
})
INFO 2024-07-10 15:45:10,233 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question'],
    num_rows: 5
})
INFO 2024-07-10 15:45:10,233 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:10,235 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:10,235 pipeline.py:47: generate Running block: eval_questions
INFO 2024-07-10 15:45:10,235 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question'],
    num_rows: 5
})
INFO 2024-07-10 15:45:11,642 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question', 'evaluation', 'score'],
    num_rows: 5
})
INFO 2024-07-10 15:45:11,642 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:11,642 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:11,642 pipeline.py:47: generate Running block: filter_questions
INFO 2024-07-10 15:45:11,642 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question', 'evaluation', 'score'],
    num_rows: 5
})
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1514.63 examples/s]
Filter: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 3644.05 examples/s]
INFO 2024-07-10 15:45:11,651 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question'],
    num_rows: 3
})
INFO 2024-07-10 15:45:11,651 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:11,652 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:11,652 pipeline.py:47: generate Running block: gen_responses
INFO 2024-07-10 15:45:11,652 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question'],
    num_rows: 3
})
INFO 2024-07-10 15:45:14,099 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response'],
    num_rows: 2
})
INFO 2024-07-10 15:45:14,099 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:14,102 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:14,102 pipeline.py:47: generate Running block: evaluate_qa_pair
INFO 2024-07-10 15:45:14,102 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response'],
    num_rows: 2
})
INFO 2024-07-10 15:45:15,430 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response', 'evaluation', 'score'],
    num_rows: 2
})
INFO 2024-07-10 15:45:15,430 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:15,430 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:15,430 pipeline.py:47: generate Running block: filter_qa_pair
INFO 2024-07-10 15:45:15,430 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response', 'evaluation', 'score'],
    num_rows: 2
})
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 710.24 examples/s]
Filter: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1515.01 examples/s]
INFO 2024-07-10 15:45:15,437 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response'],
    num_rows: 2
})
INFO 2024-07-10 15:45:15,438 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:15,438 generate_data.py:286: generate_data Generated 3 samples
INFO 2024-07-10 15:45:15,442 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:15,442 pipeline.py:47: generate Running block: gen_questions
INFO 2024-07-10 15:45:15,442 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response'],
    num_rows: 12
})
INFO 2024-07-10 15:45:16,458 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question'],
    num_rows: 12
})
INFO 2024-07-10 15:45:16,458 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:16,460 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:16,460 pipeline.py:47: generate Running block: eval_questions
INFO 2024-07-10 15:45:16,460 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question'],
    num_rows: 12
})
INFO 2024-07-10 15:45:17,872 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question', 'evaluation', 'score'],
    num_rows: 12
})
INFO 2024-07-10 15:45:17,872 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:17,872 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:17,872 pipeline.py:47: generate Running block: filter_questions
INFO 2024-07-10 15:45:17,872 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question', 'evaluation', 'score'],
    num_rows: 12
})
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 3766.78 examples/s]
Filter: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 9126.32 examples/s]
INFO 2024-07-10 15:45:17,880 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question'],
    num_rows: 12
})
INFO 2024-07-10 15:45:17,880 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:17,881 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:17,881 pipeline.py:47: generate Running block: gen_responses
INFO 2024-07-10 15:45:17,881 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question'],
    num_rows: 12
})
INFO 2024-07-10 15:45:19,117 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response'],
    num_rows: 11
})
INFO 2024-07-10 15:45:19,118 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:19,120 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:19,120 pipeline.py:47: generate Running block: evaluate_qa_pair
INFO 2024-07-10 15:45:19,120 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response'],
    num_rows: 11
})
INFO 2024-07-10 15:45:21,885 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response', 'evaluation', 'score'],
    num_rows: 11
})
INFO 2024-07-10 15:45:21,885 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:21,885 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:21,885 pipeline.py:47: generate Running block: filter_qa_pair
INFO 2024-07-10 15:45:21,885 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response', 'evaluation', 'score'],
    num_rows: 11
})
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 3498.70 examples/s]
Filter: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 8639.95 examples/s]
INFO 2024-07-10 15:45:21,893 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response'],
    num_rows: 11
})
INFO 2024-07-10 15:45:21,893 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:21,893 generate_data.py:286: generate_data Generated 4 samples
INFO 2024-07-10 15:45:21,897 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:21,897 pipeline.py:47: generate Running block: gen_questions
INFO 2024-07-10 15:45:21,897 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response'],
    num_rows: 5
})
INFO 2024-07-10 15:45:22,707 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question'],
    num_rows: 5
})
INFO 2024-07-10 15:45:22,707 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:22,709 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:22,709 pipeline.py:47: generate Running block: eval_questions
INFO 2024-07-10 15:45:22,709 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question'],
    num_rows: 5
})
INFO 2024-07-10 15:45:23,933 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question', 'evaluation', 'score'],
    num_rows: 5
})
INFO 2024-07-10 15:45:23,933 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:23,933 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:23,933 pipeline.py:47: generate Running block: filter_questions
INFO 2024-07-10 15:45:23,933 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question', 'evaluation', 'score'],
    num_rows: 5
})
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1738.07 examples/s]
Filter: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 3550.88 examples/s]
INFO 2024-07-10 15:45:23,941 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question'],
    num_rows: 5
})
INFO 2024-07-10 15:45:23,941 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:23,943 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:23,943 pipeline.py:47: generate Running block: gen_responses
INFO 2024-07-10 15:45:23,943 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question'],
    num_rows: 5
})
INFO 2024-07-10 15:45:25,208 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response'],
    num_rows: 5
})
INFO 2024-07-10 15:45:25,208 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:25,211 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:25,211 pipeline.py:47: generate Running block: evaluate_qa_pair
INFO 2024-07-10 15:45:25,211 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response'],
    num_rows: 5
})
INFO 2024-07-10 15:45:27,003 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response', 'evaluation', 'score'],
    num_rows: 5
})
INFO 2024-07-10 15:45:27,003 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:27,003 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:27,003 pipeline.py:47: generate Running block: filter_qa_pair
INFO 2024-07-10 15:45:27,003 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response', 'evaluation', 'score'],
    num_rows: 5
})
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1560.50 examples/s]
Filter: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 3725.62 examples/s]
INFO 2024-07-10 15:45:27,011 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response'],
    num_rows: 5
})
INFO 2024-07-10 15:45:27,012 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:27,012 generate_data.py:286: generate_data Generated 5 samples
INFO 2024-07-10 15:45:27,015 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:27,015 pipeline.py:47: generate Running block: gen_questions
INFO 2024-07-10 15:45:27,015 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response'],
    num_rows: 5
})
INFO 2024-07-10 15:45:28,646 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question'],
    num_rows: 5
})
INFO 2024-07-10 15:45:28,646 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:28,647 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:28,647 pipeline.py:47: generate Running block: eval_questions
INFO 2024-07-10 15:45:28,647 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question'],
    num_rows: 5
})
INFO 2024-07-10 15:45:32,433 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question', 'evaluation', 'score'],
    num_rows: 6
})
INFO 2024-07-10 15:45:32,433 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:32,433 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:32,434 pipeline.py:47: generate Running block: filter_questions
INFO 2024-07-10 15:45:32,434 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'num_samples', 'question', 'evaluation', 'score'],
    num_rows: 6
})
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 1835.05 examples/s]
Filter: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 4146.62 examples/s]
INFO 2024-07-10 15:45:32,442 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question'],
    num_rows: 4
})
INFO 2024-07-10 15:45:32,442 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:32,443 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:32,444 pipeline.py:47: generate Running block: gen_responses
INFO 2024-07-10 15:45:32,444 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question'],
    num_rows: 4
})
INFO 2024-07-10 15:45:36,834 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response'],
    num_rows: 4
})
INFO 2024-07-10 15:45:36,835 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:36,837 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:36,837 pipeline.py:47: generate Running block: evaluate_qa_pair
INFO 2024-07-10 15:45:36,837 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response'],
    num_rows: 4
})
INFO 2024-07-10 15:45:38,571 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response', 'evaluation', 'score'],
    num_rows: 4
})
INFO 2024-07-10 15:45:38,571 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:38,571 pipeline.py:45: generate ------------------------------------

INFO 2024-07-10 15:45:38,571 pipeline.py:47: generate Running block: filter_qa_pair
INFO 2024-07-10 15:45:38,571 pipeline.py:48: generate Input dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response', 'evaluation', 'score'],
    num_rows: 4
})
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 1243.86 examples/s]
Filter: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 2898.62 examples/s]
INFO 2024-07-10 15:45:38,579 pipeline.py:62: generate Output dataset: Dataset({
    features: ['task_description', 'seed_question', 'seed_response', 'question', 'response'],
    num_rows: 4
})
INFO 2024-07-10 15:45:38,579 pipeline.py:63: generate ------------------------------------


INFO 2024-07-10 15:45:38,579 generate_data.py:286: generate_data Generated 6 samples
INFO 2024-07-10 15:45:38,585 generate_data.py:304: generate_data Generation took 53.42s

@shivchander shivchander merged commit 1adf044 into main Jul 11, 2024
7 of 9 checks passed
@@ -64,7 +64,7 @@ def __init__(
self.defaults = {
"model": self.model,
"temperature": 0,
"max_tokens": 12000,
"max_tokens": 4096,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


dataset = block.generate(dataset, **gen_kwargs)

if len(dataset) == 0:
raise EmptyDatasetError(f"Pipeline stopped: Empty dataset after running block: {block_config['block_name']}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +55 to +59
if self.operation == operator.contains:
samples = samples.filter(
lambda x: self.operation(self.value, x[self.column_name]),
num_proc=self.num_procs,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example you can send me? I'm trying to write a test case for this and contains works how I expected.

I'm also not clear what behavior you're going for still running the operation with the parameters reversed right after this. Maybe that wasn't on purpose?

In any case, I think if we're looking at the same flows where it's used, we can nail it down quickly. Feel free to ping me privately on slack with details if needed.

@shivchander

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants