Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Data Curation into its own stage #181

Merged
merged 3 commits into from
Dec 27, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 3 additions & 6 deletions launcher_scripts/conf/config.yaml
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
defaults:
- _self_
- cluster: bcm # Set to bcm for BCM and BCP clusters. Set to k8s for a k8s cluster.
- data_curation: common_crawl/curate_common_crawl
- data_preparation: gpt3/download_gpt3_pile
- quality_filtering: heuristic/english
- lang_separation_and_cleaning: lang_separation_and_cleaning
- task_deduplication: task_deduplication
- training: gpt3/5b
- conversion: gpt3/convert_gpt3
- conversion_hf2nemo: hf_llama2/convert_llama2_nemo
Expand Down Expand Up @@ -32,6 +30,7 @@ stages:
#- training
- conversion
#- conversion_hf2nemo
# - conversion
#- prompt_learning
#- adapter_learning
#- ia3_learning
Expand Down Expand Up @@ -75,9 +74,7 @@ numa_mapping:

# Do not modify below, use the values above instead.
data_preparation_config: ${hydra:runtime.choices.data_preparation}
quality_filtering_config: ${hydra:runtime.choices.quality_filtering}
lang_separation_and_cleaning_config: ${hydra:runtime.choices.lang_separation_and_cleaning}
task_deduplication_config: ${hydra:runtime.choices.task_deduplication}
data_curation_config: ${hydra:runtime.choices.data_curation}
training_config: ${hydra:runtime.choices.training}
fine_tuning_config: ${hydra:runtime.choices.fine_tuning}
peft_config: ${hydra:runtime.choices.peft}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
run:
name: 'data-curation'
results_dir: ${base_results_dir}/${.name}

# Many steps in the data curator do not use GPUs
# Adjust configs here if you would like to use different cluster configurations for jobs that do/don't require GPUs
cpu_config:
partition:

gpu_config:
partition:

stages:
- lang_separation_and_cleaning
- task_deduplication

lang_separation_and_cleaning:
- fasttext_download
- language_identification
- separate_by_language
- choose_language
- text_cleaning

task_deduplication:
- prepare_task_data
- find_matching_ngrams
- remove_matching_ngrams

filter_quality:
- quality_filtering

dataset_name: common_crawl

defaults:
- common_crawl/fasttext_download/fasttext_download
- common_crawl/language_identification/language_identification
- common_crawl/separate_by_language/separate_by_language
- common_crawl/text_cleaning/text_cleaning
- common_crawl/prepare_task_data/prepare_task_data
- common_crawl/find_matching_ngrams/find_matching_ngrams
- common_crawl/remove_matching_ngrams/remove_matching_ngrams
- common_crawl/quality_filtering/heuristic/english

special:
choose_language:
language: PL # Change to language of choice based on fastText supported languages: https://fasttext.cc/docs/en/language-identification.html
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
run:
name: 'fasttext-download'
results_dir: ${lang_separation_and_cleaning.run.results_dir}/${.name}
time_limit: "00:20:00"
results_dir: ${data_curation.run.results_dir}/${.name}
dependency: "singleton"
time_limit: "00:20:00"
nodes: 1
node_type: cpu

filter_config:
filter_module: ndc.filter.classifier.filter.FastTextLangId
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
run:
name: 'find-matching-ngrams'
results_dir: ${data_curation.run.results_dir}/${.name}
dependency: "singleton"
time_limit: "08:00:00"
nodes: 2
node_type: cpu

output_matched_ngram_data: ${.run.results_dir}/matched_ngrams.pkl
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
run:
name: 'language-identification'
results_dir: ${data_curation.run.results_dir}/${.name}
dependency: "singleton"
time_limit: "04:00:00"
nodes: 1
node_type: cpu

log_scores: True
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
run:
name: 'prepare-task-data'
results_dir: ${task_deduplication.run.results_dir}/${.name}
time_limit: "04:00:00"
results_dir: ${data_curation.run.results_dir}/${.name}
dependency: "singleton"
nodes: 1
time_limit: "04:00:00"
nodes: 2
node_type: cpu

output_task_ngrams: ${task_deduplication.prepare_task_data.run.results_dir}/task_ngrams.pkl
output_task_ngrams: ${.run.results_dir}/task_ngrams.pkl
# The below flag skips computation of task n-grams if the file above is already present
# Set to False if you want to recompute task n-grams with different tasks
use_ngram_cache: True
Expand Down
Original file line number Diff line number Diff line change
@@ -1,11 +1,10 @@
run:
name: 'heuristic-filter-en'
results_dir: ${base_results_dir}/${.name}
time_limit: "08:00:00"
dependency: "singleton"
time_limit: "08:00:00"
nodes: 1
partition:
cpus_per_node: 48
node_type: cpu

# Provide the downloader, data loader and extraction modules that
# define how the dataset will be built from the URLs
Expand Down Expand Up @@ -121,6 +120,6 @@ filter:
# will stop at first filter that is triggered during the above defined pipeline
stop_at_true: True

input_dir: ${data_dir}/json/original
# input_dir: ${data_dir}/json/original
# Output directory to where filtered documents will be written
output_retained_document_dir: ${data_dir}/json/filtered/high_quality
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
run:
name: 'remove-matching-ngrams'
results_dir: ${data_curation.run.results_dir}/${.name}
dependency: "singleton"
time_limit: "08:00:00"
nodes: 2
node_type: cpu

output_task_deduped_dir: ${data_dir}/task_deduped
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
run:
name: 'separate-by-language'
results_dir: ${data_curation.run.results_dir}/${.name}
dependency: "singleton"
time_limit: "01:00:00"
nodes: 1
node_type: cpu

output_data_dir: ${data_dir}/lang_separated
output_language_distribution: ${.run.results_dir}/lang_distro.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
run:
name: 'text-cleaning'
results_dir: ${data_curation.run.results_dir}/${.name}
time_limit: "04:00:00"
nodes: 1
node_type: cpu

output_clean_dir: ${data_dir}/clean

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

10 changes: 2 additions & 8 deletions launcher_scripts/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,7 @@

import hydra
import omegaconf
from nemo_launcher.core.data_curation_stages import (
QualityFiltering,
LangSeparationAndCleaning,
TaskDeduplication,
)
from nemo_launcher.core.data_curation_stages import DataCurationStage
from nemo_launcher.core.data_stages import (
CustomDataPreparation,
MC4DataPreparation,
Expand Down Expand Up @@ -82,9 +78,7 @@
},
"rlhf_rm": RLHFRewardModel,
"rlhf_ppo": RLHFPPO,
"quality_filtering": QualityFiltering,
"lang_separation_and_cleaning": LangSeparationAndCleaning,
"task_deduplication": TaskDeduplication,
"data_curation": DataCurationStage,
"steerlm_reg": SteerLMRegSFT,
}

Expand Down
Loading
Loading