update parsing of dataset_config.file to prevent custom-function-name from clobbering data-collator name. #829

yaoshiang · 2025-01-03T20:26:55Z

What does this PR do?

Makes a minor fix to the parsing of the --custom_dataset.file flag. The documentation says you can add a colon in this value to specify a custom name to replace the get_custom_dataset function.

Unfortunately, the string after the colon is ALSO used to set a custom data collator name. This update forces the data collator name to always be "get_data_collator", and updates the documentionation to reflect that.

Fixes #828

Feature/Issue validation/testing

Use the "custom_dataset" provided from the recipes. Copy it to a local directory as

cp ../llama-recipes/recipes/quickstart/finetuning/datasets/custom_dataset.py .

Verify that the custom_dataset.py works correctly, when the colon is not present.

(llama-dev) azureuser@yh-a100:~/cloudfiles/code/test_llama$ ./working.sh 
++ python -m llama_recipes.finetuning --dataset custom_dataset --custom_dataset.file custom_dataset.py --model_name meta-llama/Llama-3.2-1B-Instruct --use_peft --peft_method lora --num_epochs 1 --max_train_step 2 --max_eval_step 3
/mnt/batch/tasks/shared/LS_root/mounts/clusters/yh-a100/code/llama-recipes/src/llama_recipes/model_checkpointing/checkpoint_handler.py:17: DeprecationWarning: `torch.distributed._shard.checkpoint` will be deprecated, use `torch.distributed.checkpoint` instead
  from torch.distributed._shard.checkpoint import (
--> Model meta-llama/Llama-3.2-1B-Instruct

--> meta-llama/Llama-3.2-1B-Instruct has 1235.8144 Million params

trainable params: 851,968 || all params: 1,236,666,368 || trainable%: 0.0689
Parameter 'function'=<function get_custom_dataset.<locals>.<lambda> at 0x7fe42c924790> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 9846/9846 [00:00<00:00, 10038.77 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 9846/9846 [00:00<00:00, 23799.51 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 44042/44042 [00:01<00:00, 27242.59 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 44042/44042 [00:52<00:00, 837.00 examples/s]
--> Training Set Length = 44042
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 518/518 [00:00<00:00, 13515.03 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 518/518 [00:00<00:00, 22014.00 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 2241/2241 [00:00<00:00, 31612.27 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 2241/2241 [00:02<00:00, 848.35 examples/s]
--> Validation Set Length = 2241
Preprocessing dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████| 44042/44042 [00:15<00:00, 2856.91it/s]
length of dataset_train 3974
Can not find the custom data_collator in the dataset.py file (custom_dataset.py).
Using the default data_collator instead.
--> Num of Training Set Batches loaded = 993
Preprocessing dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2241/2241 [00:00<00:00, 2811.06it/s]
--> Num of Validation Set Batches loaded = 206
--> Num of Validation Set Batches loaded = 206
Starting epoch 0/1
train_config.max_train_step: 2
/anaconda/envs/llama-dev/lib/python3.10/site-packages/torch/cuda/memory.py:365: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch: 1:   0%|                                                                                                           | 0/993 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Training Epoch: 1/1, step 1/993 completed (loss: 1.5411758422851562):   0%|                                                | 2/993 [00:02<16:33,  1.00s/it]max training steps reached, stopping training, total train steps finished:  2
Training Epoch: 1/1, step 1/993 completed (loss: 1.5411758422851562):   0%|                                                | 2/993 [00:02<20:15,  1.23s/it]
Max CUDA memory allocated was 54 GB
Max CUDA memory reserved was 55 GB
Peak active CUDA memory was 54 GB
CUDA Malloc retries : 0
CPU Total Peak Memory consumed during the train (max): 4 GB
evaluating Epoch:   0%|                                                                                                            | 0/206 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
evaluating Epoch:   1%|█▍                                                                                                  | 3/206 [00:00<00:22,  9.05it/s]max eval steps reached, stopping evaluation, total_eval_steps:  3
evaluating Epoch:   1%|█▍                                                                                                  | 3/206 [00:00<00:26,  7.54it/s]
 eval_ppl=tensor(1.0240, device='cuda:0') eval_epoch_loss=tensor(0.0237, device='cuda:0')
we are about to save the PEFT modules
PEFT modules are saved in PATH/to/save/PEFT/model directory
best eval loss on epoch 1 is 0.02374742180109024
Epoch 1: train_perplexity=1.0031, train_epoch_loss=0.0031, epoch time 3.264616988000853s
Key: avg_train_prep, Value: 1.0031403303146362
Key: avg_train_loss, Value: 0.003135421546176076
Key: avg_eval_prep, Value: 1.024031639099121
Key: avg_eval_loss, Value: 0.02374742180109024
Key: avg_epoch_time, Value: 3.264616988000853
Key: avg_checkpoint_time, Value: 0.5262662229997659

From main, run this script to verify that an error occurs when using the colon to try to set a custom name to replace get_custom_dataset. Notice that the function get_custom_dataset is called by the codepath that's actually trying to call the data collator.

recipes/src/llama_recipes/datasets/custom_dataset.py", line 53, in get_data_collator
    return getattr(module, func_name)(dataset_processer)
TypeError: get_custom_dataset() missing 2 required positional arguments: 'tokenizer' and 'split'

(llama-dev) azureuser@yh-a100:~/cloudfiles/code/test_llama$ ./not-working.sh 
++ python -m llama_recipes.finetuning --dataset custom_dataset --custom_dataset.file custom_dataset.py:get_custom_dataset --model_name meta-llama/Llama-3.2-1B-Instruct --use_peft --peft_method lora --num_epochs 1 --max_train_step 2 --max_eval_step 3
/mnt/batch/tasks/shared/LS_root/mounts/clusters/yh-a100/code/llama-recipes/src/llama_recipes/model_checkpointing/checkpoint_handler.py:17: DeprecationWarning: `torch.distributed._shard.checkpoint` will be deprecated, use `torch.distributed.checkpoint` instead
  from torch.distributed._shard.checkpoint import (
--> Model meta-llama/Llama-3.2-1B-Instruct

--> meta-llama/Llama-3.2-1B-Instruct has 1235.8144 Million params

trainable params: 851,968 || all params: 1,236,666,368 || trainable%: 0.0689
Parameter 'function'=<function get_custom_dataset.<locals>.<lambda> at 0x7f94d384c700> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 9846/9846 [00:00<00:00, 10114.99 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 9846/9846 [00:00<00:00, 23359.21 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 44042/44042 [00:01<00:00, 28572.73 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 44042/44042 [00:51<00:00, 858.74 examples/s]
--> Training Set Length = 44042
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 518/518 [00:00<00:00, 13564.99 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 518/518 [00:00<00:00, 21441.96 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 2241/2241 [00:00<00:00, 30807.72 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 2241/2241 [00:02<00:00, 858.85 examples/s]
--> Validation Set Length = 2241
Preprocessing dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████| 44042/44042 [00:15<00:00, 2798.59it/s]
length of dataset_train 3974
Traceback (most recent call last):
  File "/anaconda/envs/llama-dev/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/anaconda/envs/llama-dev/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/yh-a100/code/llama-recipes/src/llama_recipes/finetuning.py", line 428, in <module>
    fire.Fire(main)
  File "/anaconda/envs/llama-dev/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/anaconda/envs/llama-dev/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/anaconda/envs/llama-dev/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/yh-a100/code/llama-recipes/src/llama_recipes/finetuning.py", line 346, in main
    custom_data_collator = get_custom_data_collator(dataset_processer, dataset_config)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/yh-a100/code/llama-recipes/src/llama_recipes/utils/dataset_utils.py", line 36, in get_custom_data_collator
    return DATALOADER_COLLATE_FUNC[dataset_config.dataset](
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/yh-a100/code/llama-recipes/src/llama_recipes/datasets/custom_dataset.py", line 53, in get_data_collator
    return getattr(module, func_name)(dataset_processer)
TypeError: get_custom_dataset() missing 2 required positional arguments: 'tokenizer' and 'split'

Finally, from this PR's branch, we run the same command-line flags as above. notice the key line, indicating that the code is searching for a get_data_collator function, instead of get_custom_dataset, does not find it, and uses the default. Then it successfully fine tunes the model using the custom dataset.

Can not find the custom data_collator in the dataset.py file (custom_dataset.py).
Using the default data_collator instead.

(llama-dev) azureuser@yh-a100:~/cloudfiles/code/test_llama$ ./not-working.sh 
++ python -m llama_recipes.finetuning --dataset custom_dataset --custom_dataset.file custom_dataset.py:get_custom_dataset --model_name meta-llama/Llama-3.2-1B-Instruct --use_peft --peft_method lora --num_epochs 1 --max_train_step 2 --max_eval_step 3
/mnt/batch/tasks/shared/LS_root/mounts/clusters/yh-a100/code/llama-recipes/src/llama_recipes/model_checkpointing/checkpoint_handler.py:17: DeprecationWarning: `torch.distributed._shard.checkpoint` will be deprecated, use `torch.distributed.checkpoint` instead
  from torch.distributed._shard.checkpoint import (
--> Model meta-llama/Llama-3.2-1B-Instruct

--> meta-llama/Llama-3.2-1B-Instruct has 1235.8144 Million params

trainable params: 851,968 || all params: 1,236,666,368 || trainable%: 0.0689
Parameter 'function'=<function get_custom_dataset.<locals>.<lambda> at 0x7feb0de7c790> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 9846/9846 [00:00<00:00, 10401.56 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 9846/9846 [00:00<00:00, 26194.96 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 44042/44042 [00:01<00:00, 29545.34 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 44042/44042 [00:51<00:00, 847.30 examples/s]
--> Training Set Length = 44042
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 518/518 [00:00<00:00, 12966.32 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 518/518 [00:00<00:00, 20466.57 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 2241/2241 [00:00<00:00, 27766.67 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 2241/2241 [00:02<00:00, 801.33 examples/s]
--> Validation Set Length = 2241
Preprocessing dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████| 44042/44042 [00:15<00:00, 2846.23it/s]
length of dataset_train 3974
Can not find the custom data_collator in the dataset.py file (custom_dataset.py).
Using the default data_collator instead.
--> Num of Training Set Batches loaded = 993
Preprocessing dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2241/2241 [00:00<00:00, 2736.16it/s]
--> Num of Validation Set Batches loaded = 206
--> Num of Validation Set Batches loaded = 206
Starting epoch 0/1
train_config.max_train_step: 2
/anaconda/envs/llama-dev/lib/python3.10/site-packages/torch/cuda/memory.py:365: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
Training Epoch: 1:   0%|                                                                                                           | 0/993 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Training Epoch: 1/1, step 1/993 completed (loss: 1.5411139726638794):   0%|                                                | 2/993 [00:02<16:21,  1.01it/s]max training steps reached, stopping training, total train steps finished:  2
Training Epoch: 1/1, step 1/993 completed (loss: 1.5411139726638794):   0%|                                                | 2/993 [00:02<19:56,  1.21s/it]
Max CUDA memory allocated was 54 GB
Max CUDA memory reserved was 55 GB
Peak active CUDA memory was 54 GB
CUDA Malloc retries : 0
CPU Total Peak Memory consumed during the train (max): 4 GB
evaluating Epoch:   0%|                                                                                                            | 0/206 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
evaluating Epoch:   1%|█▍                                                                                                  | 3/206 [00:00<00:22,  9.08it/s]max eval steps reached, stopping evaluation, total_eval_steps:  3
evaluating Epoch:   1%|█▍                                                                                                  | 3/206 [00:00<00:26,  7.54it/s]
 eval_ppl=tensor(1.0240, device='cuda:0') eval_epoch_loss=tensor(0.0238, device='cuda:0')
we are about to save the PEFT modules
PEFT modules are saved in PATH/to/save/PEFT/model directory
best eval loss on epoch 1 is 0.0237506702542305
Epoch 1: train_perplexity=1.0031, train_epoch_loss=0.0031, epoch time 3.2322842220000894s
Key: avg_train_prep, Value: 1.0031403303146362
Key: avg_train_loss, Value: 0.0031353593803942204
Key: avg_eval_prep, Value: 1.0240349769592285
Key: avg_eval_loss, Value: 0.0237506702542305
Key: avg_epoch_time, Value: 3.2322842220000894
Key: avg_checkpoint_time, Value: 0.5249914010000793

Before submitting

[Y] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[Y] Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue? Please add a link
to it if that's the case.
[Y] Did you make sure to update the documentation with your changes?
[N/A] Did you write any new necessary tests?

Thanks for contributing 🎉!

… from clobbering the data-collator function name.

update parsing of dataset_config.file to prevent custom-function-name…

2172ff1

… from clobbering the data-collator function name.

facebook-github-bot added the cla signed label Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update parsing of dataset_config.file to prevent custom-function-name from clobbering data-collator name. #829

update parsing of dataset_config.file to prevent custom-function-name from clobbering data-collator name. #829

yaoshiang commented Jan 3, 2025

update parsing of dataset_config.file to prevent custom-function-name from clobbering data-collator name. #829

Are you sure you want to change the base?

update parsing of dataset_config.file to prevent custom-function-name from clobbering data-collator name. #829

Conversation

yaoshiang commented Jan 3, 2025

What does this PR do?

Feature/Issue validation/testing

Before submitting