Refactor Caching and Introduce Data Checkpointing #19

shivchander · 2024-07-24T07:22:33Z

This PR Introduces data checkpointing and a simple driver file for enhanced data generation and testing in the SDG framework.

This PR is based on #9, but improves it

e8522e2: Refactor Caching to Data Checkpointing: Implemented data checkpointing system, replacing end-of-process caching for improved progress tracking and recovery.
61eff48: Add Simple Driver File for Data Generation: Introduced a basic driver file to facilitate data generation and testing within the SDG framework.

commit e8522e2
Author: shiv [email protected]
Date: Wed Jul 24 03:16:00 2024 -0400

Refactor Caching to Data Checkpointing

The previous implementation of data caching saved a cache only at the 
end of the generation process, serving as a basic version. This commit 
introduces a more robust data checkpointing system:

- Data Checkpointing: Replaces the old caching method with a checkpointing 
  system that saves intermediate states during data generation. This allows 
  for more granular progress tracking and recovery.

- Intermediate Saves: Checkpoints are saved periodically based on the 
  save_freq setting, preventing data loss and enabling resumption from 
  the last saved state in case of interruptions.

- Resuming from Checkpoint: The system can resume from a saved checkpoint 
  by comparing the generated data with the seed data to identify and process 
  missing data.

- UUID-based Identifiers: Each checkpoint is uniquely identified using UUIDs, 
  ensuring distinct and traceable save points.

- Removed Old Caching Logic: Eliminated the previous end-of-process caching 
  approach, which has been superseded by the new checkpointing mechanism.

- Improved Error Handling and Logging: Added comprehensive error logging and 
  handling to track issues and progress during dataset generation.

Signed-off-by: shiv <[email protected]>

commit 61eff48
Author: shiv [email protected]
Date: Wed Jul 24 03:16:00 2024 -0400

Add Simple Driver File for Data Generation

Currently, we do not have any driver file apart from the ilab data 
generator to test SDG. This is not ideal for the dev and mainly for the 
research community. This is an initial attempt to have a simple driver file.

- Purpose: Provides a basic script for generating data, addressing the 
  lack of a dedicated driver file apart from the ilab data generator. 
  Aims to assist developers and the research community by providing a 
  straightforward entry point for testing and experimentation.

- Notes: Assumes familiarity with the codebase, particularly in terms of 
  dataset formatting and configuration. This is an initial version and may 
  require further iteration.

This simple driver file serves as a starting point for data generation 
tasks, providing a practical tool for developers and researchers to engage 
with the SDG framework.

Signed-off-by: shiv <[email protected]>

The previous implementation of data caching saved a cache only at the end of the generation process, serving as a basic version. This commit introduces a more robust data checkpointing system: - Data Checkpointing: Replaces the old caching method with a checkpointing system that saves intermediate states during data generation. This allows for more granular progress tracking and recovery. - Intermediate Saves: Checkpoints are saved periodically based on the save_freq setting, preventing data loss and enabling resumption from the last saved state in case of interruptions. - Resuming from Checkpoint: The system can resume from a saved checkpoint by comparing the generated data with the seed data to identify and process missing data. - UUID-based Identifiers: Each checkpoint is uniquely identified using UUIDs, ensuring distinct and traceable save points. - Removed Old Caching Logic: Eliminated the previous end-of-process caching approach, which has been superseded by the new checkpointing mechanism. - Improved Error Handling and Logging: Added comprehensive error logging and handling to track issues and progress during dataset generation. This commit enhances the reliability and flexibility of the data generation process by transitioning from a simple caching system to a comprehensive data checkpointing mechanism. Signed-off-by: shiv <[email protected]>

Currently, we do not have any driver file apart from the ilab data generator to test SDG. This is not ideal for the dev and mainly for the research community. This is an initial attempt to have a simple driver file. - Purpose: Provides a basic script for generating data, addressing the lack of a dedicated driver file apart from the ilab data generator. Aims to assist developers and the research community by providing a straightforward entry point for testing and experimentation. - Notes: Assumes familiarity with the codebase, particularly in terms of dataset formatting and configuration. This is an initial version and may require further iteration. This simple driver file serves as a starting point for data generation tasks, providing a practical tool for developers and researchers to engage with the SDG framework. Signed-off-by: shiv <[email protected]>

shivchander added 2 commits July 24, 2024 06:57

shivchander requested a review from aakankshaduggal July 24, 2024 07:22

markmc mentioned this pull request Jul 24, 2024

[Epic] Add data checkpointing and recovery instructlab/sdg#195

Closed

4 tasks

aakankshaduggal approved these changes Jul 24, 2024

View reviewed changes

aakankshaduggal merged commit a302954 into main Jul 24, 2024
7 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Caching and Introduce Data Checkpointing #19

Refactor Caching and Introduce Data Checkpointing #19

shivchander commented Jul 24, 2024 •

edited

Loading

Refactor Caching and Introduce Data Checkpointing #19

Refactor Caching and Introduce Data Checkpointing #19

Conversation

shivchander commented Jul 24, 2024 • edited Loading

This PR Introduces data checkpointing and a simple driver file for enhanced data generation and testing in the SDG framework.

shivchander commented Jul 24, 2024 •

edited

Loading