Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Caching and Introduce Data Checkpointing #19

Merged
merged 2 commits into from
Jul 24, 2024

Conversation

shivchander
Copy link
Collaborator

@shivchander shivchander commented Jul 24, 2024

This PR Introduces data checkpointing and a simple driver file for enhanced data generation and testing in the SDG framework.

This PR is based on #9, but improves it

e8522e2: Refactor Caching to Data Checkpointing: Implemented data checkpointing system, replacing end-of-process caching for improved progress tracking and recovery.
61eff48: Add Simple Driver File for Data Generation: Introduced a basic driver file to facilitate data generation and testing within the SDG framework.

commit e8522e2
Author: shiv [email protected]
Date: Wed Jul 24 03:16:00 2024 -0400

Refactor Caching to Data Checkpointing

The previous implementation of data caching saved a cache only at the 
end of the generation process, serving as a basic version. This commit 
introduces a more robust data checkpointing system:

- Data Checkpointing: Replaces the old caching method with a checkpointing 
  system that saves intermediate states during data generation. This allows 
  for more granular progress tracking and recovery.

- Intermediate Saves: Checkpoints are saved periodically based on the 
  save_freq setting, preventing data loss and enabling resumption from 
  the last saved state in case of interruptions.

- Resuming from Checkpoint: The system can resume from a saved checkpoint 
  by comparing the generated data with the seed data to identify and process 
  missing data.

- UUID-based Identifiers: Each checkpoint is uniquely identified using UUIDs, 
  ensuring distinct and traceable save points.

- Removed Old Caching Logic: Eliminated the previous end-of-process caching 
  approach, which has been superseded by the new checkpointing mechanism.

- Improved Error Handling and Logging: Added comprehensive error logging and 
  handling to track issues and progress during dataset generation.

Signed-off-by: shiv <[email protected]>

commit 61eff48
Author: shiv [email protected]
Date: Wed Jul 24 03:16:00 2024 -0400

Add Simple Driver File for Data Generation

Currently, we do not have any driver file apart from the ilab data 
generator to test SDG. This is not ideal for the dev and mainly for the 
research community. This is an initial attempt to have a simple driver file.

- Purpose: Provides a basic script for generating data, addressing the 
  lack of a dedicated driver file apart from the ilab data generator. 
  Aims to assist developers and the research community by providing a 
  straightforward entry point for testing and experimentation.

- Notes: Assumes familiarity with the codebase, particularly in terms of 
  dataset formatting and configuration. This is an initial version and may 
  require further iteration.

This simple driver file serves as a starting point for data generation 
tasks, providing a practical tool for developers and researchers to engage 
with the SDG framework.

Signed-off-by: shiv <[email protected]>

The previous implementation of data caching saved a cache only at the end of the generation process, serving as a basic version. This commit introduces a more robust data checkpointing system:

- Data Checkpointing: Replaces the old caching method with a checkpointing system that saves intermediate states during data generation. This allows for more granular progress tracking and recovery.
  - Intermediate Saves: Checkpoints are saved periodically based on the save_freq setting, preventing data loss and enabling resumption from the last saved state in case of interruptions.
  - Resuming from Checkpoint: The system can resume from a saved checkpoint by comparing the generated data with the seed data to identify and process missing data.
  - UUID-based Identifiers: Each checkpoint is uniquely identified using UUIDs, ensuring distinct and traceable save points.
- Removed Old Caching Logic: Eliminated the previous end-of-process caching approach, which has been superseded by the new checkpointing mechanism.
- Improved Error Handling and Logging: Added comprehensive error logging and handling to track issues and progress during dataset generation.

This commit enhances the reliability and flexibility of the data generation process by transitioning from a simple caching system to a comprehensive data checkpointing mechanism.

Signed-off-by: shiv <[email protected]>
Currently, we do not have any driver file apart from the ilab data generator to test SDG. This is not ideal for the dev and mainly for the research community. This is an initial attempt to have a simple driver file.

- Purpose: Provides a basic script for generating data, addressing the lack of a dedicated driver file apart from the ilab data generator. Aims to assist developers and the research community by providing a straightforward entry point for testing and experimentation.

- Notes: Assumes familiarity with the codebase, particularly in terms of dataset formatting and configuration. This is an initial version and may require further iteration.

This simple driver file serves as a starting point for data generation tasks, providing a practical tool for developers and researchers to engage with the SDG framework.

Signed-off-by: shiv <[email protected]>
@aakankshaduggal aakankshaduggal merged commit a302954 into main Jul 24, 2024
7 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants