-
Notifications
You must be signed in to change notification settings - Fork 828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
关于自训tokenizer的问题 #109
Comments
你可能指的是Unicode字符吧 import json
def is_chinese(char):
return any([
'\u4e00' <= char <= '\u9fff',
'\u3400' <= char <= '\u4dbf',
'\u20000' <= char <= '\u2a6df',
'\u2a700' <= char <= '\u2b73f',
'\u2b740' <= char <= '\u2b81f',
'\u2b820' <= char <= '\u2ceaf',
'\uF900' <= char <= '\uFAFF',
'\u2F800' <= char <= '\u2fa1f'
])
def calculate_chinese_ratio(file_path):
total_chars = 0
chinese_chars = 0
with open(file_path, 'r', encoding='utf-8') as f:
for line_num, line in enumerate(f, 1):
try:
data = json.loads(line)
text = data.get("text", "")
for char in text:
total_chars += 1
if is_chinese(char):
chinese_chars += 1
except json.JSONDecodeError as e:
print(f"第 {line_num} 行解析错误: {e}")
continue
if total_chars == 0:
print("文本中没有任何字符。")
return
ratio = chinese_chars / total_chars
percentage = ratio * 100
print(f"总字符数: {total_chars}")
print(f"中文字符数: {chinese_chars}")
print(f"中文字符占比: {percentage:.2f}%")
if __name__ == "__main__":
file_path = 'dataset/tokenizer_train.jsonl'
calculate_chinese_ratio(file_path) 可以看到
|
more |
感谢作者大佬,受益匪浅!
…---- 回复的原邮件 ----
| 发件人 | ***@***.***> |
| 发送日期 | 2025年01月09日 15:51 |
| 收件人 | jingyaogong/minimind ***@***.***> |
| 抄送人 | Xie ***@***.***>,
Author ***@***.***> |
| 主题 | Re: [jingyaogong/minimind] 关于自训tokenizer的问题 (Issue #109) |
more
这个issue:#111 的回答可能可以补充你的疑问
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
@jingyaogong 作者你好,请问下这里训练分词器的数据处理成 Unicode 字符的用意是什么?另外常规来说,分词器训练数据集是不是和预训练数据集用的是同一份? |
你好,第一个问题比较通用,我直接贴一下GPT4的回复: 背景:Unicode 是一种通用字符编码标准,涵盖了世界上几乎所有的语言和符号。每个字符在 Unicode 中都有唯一的编号(称为码点,Code Point)。例如:
用意:将文本处理成 Unicode 字符的主要原因包括以下几点:
实际应用:例如,SentencePiece 分词器会将文本先转换为 Unicode 字符序列,然后基于这些字符进行子词分割。这种方式特别适合处理没有明确单词边界的语言(如中文、日文)。 2. 分词器训练数据集是否和预训练数据集是同一份?用的是sft的数据,用什么其实不是很重要 |
想问下tokenizer_train的jsonl文件里为什么不用中文呢,这样不会导致中文编码效率过低的问题嘛?
The text was updated successfully, but these errors were encountered: