Skip to content

Commit

Permalink
Added unmask samples to pretrain logging
Browse files Browse the repository at this point in the history
Signed-off-by: Mustafa Eyceoz <[email protected]>
  • Loading branch information
Maxusmusti committed Nov 25, 2024
1 parent 23725b1 commit 35ee0c5
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion src/instructlab/training/data_process.py
Original file line number Diff line number Diff line change
Expand Up @@ -355,7 +355,10 @@ def main(args: DataProcessArgs):
print("\033[92mCategorizing training data type...\033[0m")
data_with_input_ids = data_with_input_ids.map(
lambda x: {
"is_pretrain": get_sp_token(tokenizer, "<|pretrain|>")[0] in x["input_ids"]
"is_pretrain": (
get_sp_token(tokenizer, "<|pretrain|>")[0] in x["input_ids"]
)
or x["unmask"]
},
num_proc=NUM_PROC,
)
Expand Down

0 comments on commit 35ee0c5

Please sign in to comment.