Skip to content

Commit

Permalink
* minor changes on docs (modelscope#2)
Browse files Browse the repository at this point in the history
* * replace the image in README with a smaller size version
- remove all .DS_Store files
* fix invalid links and contents
+ add .gitignore and .pre-commit-config.yaml

* - remove .DS_Store files
  • Loading branch information
HYLcool authored Aug 2, 2023
1 parent b93c065 commit ffbb5bd
Show file tree
Hide file tree
Showing 23 changed files with 64 additions and 12 deletions.
Binary file removed .DS_Store
Binary file not shown.
15 changes: 15 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@

# data & resources
models/
outputs/
assets/

# setup
data_juicer.egg-info/
build/
dist

# others
.DS_Store
.idea/
__pycache__
37 changes: 37 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
repos:
- repo: https://github.com/PyCQA/flake8
rev: 4.0.1
hooks:
- id: flake8
- repo: https://github.com/PyCQA/isort.git
rev: 4.3.21
hooks:
- id: isort
- repo: https://github.com/pre-commit/mirrors-yapf
rev: v0.32.0
hooks:
- id: yapf
exclude: data_juicer/ops/common/special_characters.py
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.3.0
hooks:
- id: trailing-whitespace
exclude: thirdparty/
- id: check-yaml
exclude: thirdparty/
- id: end-of-file-fixer
exclude: thirdparty/
- id: requirements-txt-fixer
exclude: thirdparty/
- id: double-quote-string-fixer
exclude: ^(thirdparty/|data_juicer/ops/common/special_characters.py)
- id: check-merge-conflict
exclude: thirdparty/
- id: fix-encoding-pragma
exclude: thirdparty/
args: [ "--remove" ]
- id: mixed-line-ending
exclude: thirdparty/
args: [ "--fix=lf" ]

exclude: 'docs/.*'
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Data-Juicer: A Data-Centric Text Processing System for Large Language Models

![Data-Juicer](docs/imgs/data-juicer.png "Data-Juicer")
![Data-Juicer](docs/imgs/data-juicer.jpg "Data-Juicer")

![](https://img.shields.io/badge/language-Python-214870.svg)
![](https://img.shields.io/badge/license-Apache--2.0-000000.svg)
[![Contributing](https://img.shields.io/badge/Contribution-welcome-brightgreen.svg)](docs/DeveloperGuide.md)

[![Document_List](https://img.shields.io/badge/Docs-English-blue?logo=Markdown)](#documentation-|-文档)
[![Document_List](https://img.shields.io/badge/Docs-English-blue?logo=Markdown)](#documentation--文档)
[![文档列表](https://img.shields.io/badge/文档-中文-blue?logo=Markdown)](README_ZH.md)
[![API Reference](https://img.shields.io/badge/Docs-API_Reference-blue?logo=Markdown)](https://alibaba.github.io/data-juicer/)
[![ModelScope-10+ Demos](https://img.shields.io/badge/ModelScope-10+_Demos-4e29ff.svg?logo=data:image/svg+xml;base64,PHN2ZyB2aWV3Qm94PSIwIDAgMjI0IDEyMS4zMyIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KCTxwYXRoIGQ9Im0wIDQ3Ljg0aDI1LjY1djI1LjY1aC0yNS42NXoiIGZpbGw9IiM2MjRhZmYiIC8+Cgk8cGF0aCBkPSJtOTkuMTQgNzMuNDloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzYyNGFmZiIgLz4KCTxwYXRoIGQ9Im0xNzYuMDkgOTkuMTRoLTI1LjY1djIyLjE5aDQ3Ljg0di00Ny44NGgtMjIuMTl6IiBmaWxsPSIjNjI0YWZmIiAvPgoJPHBhdGggZD0ibTEyNC43OSA0Ny44NGgyNS42NXYyNS42NWgtMjUuNjV6IiBmaWxsPSIjMzZjZmQxIiAvPgoJPHBhdGggZD0ibTAgMjIuMTloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzM2Y2ZkMSIgLz4KCTxwYXRoIGQ9Im0xOTguMjggNDcuODRoMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzYyNGFmZiIgLz4KCTxwYXRoIGQ9Im0xOTguMjggMjIuMTloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzM2Y2ZkMSIgLz4KCTxwYXRoIGQ9Im0xNTAuNDQgMHYyMi4xOWgyNS42NXYyNS42NWgyMi4xOXYtNDcuODR6IiBmaWxsPSIjNjI0YWZmIiAvPgoJPHBhdGggZD0ibTczLjQ5IDQ3Ljg0aDI1LjY1djI1LjY1aC0yNS42NXoiIGZpbGw9IiMzNmNmZDEiIC8+Cgk8cGF0aCBkPSJtNDcuODQgMjIuMTloMjUuNjV2LTIyLjE5aC00Ny44NHY0Ny44NGgyMi4xOXoiIGZpbGw9IiM2MjRhZmYiIC8+Cgk8cGF0aCBkPSJtNDcuODQgNzMuNDloLTIyLjE5djQ3Ljg0aDQ3Ljg0di0yMi4xOWgtMjUuNjV6IiBmaWxsPSIjNjI0YWZmIiAvPgo8L3N2Zz4K)](#demos)
Expand Down Expand Up @@ -34,7 +34,7 @@ Table of Contents
* [Data Visualization](#data-visualization)
* [Build Up Config Files](#build-up-config-files)
* [Preprocess raw data (Optional)](#preprocess-raw-data-optional)
* [Documentation | 文档](#documentation-|-文档)
* [Documentation | 文档](#documentation--文档)
* [Data Recipes](#data-recipes)
* [Demos](#demos)
* [License](#license)
Expand All @@ -53,7 +53,7 @@ Table of Contents

- **Comprehensive Processing Recipes**: Offering tens of [pre-built data processing recipes](configs/refine_recipe/README.md) for pre-training, SFT, en, zh, and more scenarios.

- **User-Friendly Experience**: Designed for simplicity, with [comprehensive documentation](#documentation-|-文档), [easy start guides](#quick-start) and [demo configs](configs/), and intuitive configuration with simple adding/removing OPs from [existing configs](configs/config_all.yaml).
- **User-Friendly Experience**: Designed for simplicity, with [comprehensive documentation](#documentation--文档), [easy start guides](#quick-start) and [demo configs](configs/), and intuitive configuration with simple adding/removing OPs from [existing configs](configs/config_all.yaml).

- **Flexible & Extensible**: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to [implement your own OPs](docs/DeveloperGuide.md#build-your-own-ops) for customizable data processing.

Expand Down Expand Up @@ -169,7 +169,7 @@ streamlit run app.py

## Documentation | 文档

- [Overall](README.md) | [概览](README_ZH.md)
- [Overview](README.md) | [概览](README_ZH.md)
- [Operator Zoo](docs/Operators.md) | [算子库](docs/Operators_ZH.md)
- [Configs](configs/README.md) | [配置系统](configs/README_ZH.md)
- [Developer Guide](docs/DeveloperGuide.md) | [开发者指南](docs/DeveloperGuide_ZH.md)
Expand Down
10 changes: 5 additions & 5 deletions README_ZH.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Data-Juicer: 为大语言模型提供更高质量、更丰富、更易“消化”的数据

![Data-Juicer](docs/imgs/data-juicer.png "Data-Juicer")
![Data-Juicer](docs/imgs/data-juicer.jpg "Data-Juicer")

![](https://img.shields.io/badge/language-Python-214870.svg)
![](https://img.shields.io/badge/license-Apache--2.0-000000.svg)
[![Contributing](https://img.shields.io/badge/Contribution-welcome-brightgreen.svg)](docs/DeveloperGuide_ZH.md)

[![Document_List](https://img.shields.io/badge/Docs-English-blue?logo=Markdown)](#documentation-|-文档)
[![Document_List](https://img.shields.io/badge/Docs-English-blue?logo=Markdown)](#documentation--文档)
[![文档列表](https://img.shields.io/badge/文档-中文-blue?logo=Markdown)](README_ZH.md)
[![API Reference](https://img.shields.io/badge/Docs-API_Reference-blue?logo=Markdown)](https://alibaba.github.io/data-juicer/)
[![ModelScope-10+ Demos](https://img.shields.io/badge/ModelScope-10+_Demos-4e29ff.svg?logo=data:image/svg+xml;base64,PHN2ZyB2aWV3Qm94PSIwIDAgMjI0IDEyMS4zMyIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KCTxwYXRoIGQ9Im0wIDQ3Ljg0aDI1LjY1djI1LjY1aC0yNS42NXoiIGZpbGw9IiM2MjRhZmYiIC8+Cgk8cGF0aCBkPSJtOTkuMTQgNzMuNDloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzYyNGFmZiIgLz4KCTxwYXRoIGQ9Im0xNzYuMDkgOTkuMTRoLTI1LjY1djIyLjE5aDQ3Ljg0di00Ny44NGgtMjIuMTl6IiBmaWxsPSIjNjI0YWZmIiAvPgoJPHBhdGggZD0ibTEyNC43OSA0Ny44NGgyNS42NXYyNS42NWgtMjUuNjV6IiBmaWxsPSIjMzZjZmQxIiAvPgoJPHBhdGggZD0ibTAgMjIuMTloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzM2Y2ZkMSIgLz4KCTxwYXRoIGQ9Im0xOTguMjggNDcuODRoMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzYyNGFmZiIgLz4KCTxwYXRoIGQ9Im0xOTguMjggMjIuMTloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzM2Y2ZkMSIgLz4KCTxwYXRoIGQ9Im0xNTAuNDQgMHYyMi4xOWgyNS42NXYyNS42NWgyMi4xOXYtNDcuODR6IiBmaWxsPSIjNjI0YWZmIiAvPgoJPHBhdGggZD0ibTczLjQ5IDQ3Ljg0aDI1LjY1djI1LjY1aC0yNS42NXoiIGZpbGw9IiMzNmNmZDEiIC8+Cgk8cGF0aCBkPSJtNDcuODQgMjIuMTloMjUuNjV2LTIyLjE5aC00Ny44NHY0Ny44NGgyMi4xOXoiIGZpbGw9IiM2MjRhZmYiIC8+Cgk8cGF0aCBkPSJtNDcuODQgNzMuNDloLTIyLjE5djQ3Ljg0aDQ3Ljg0di0yMi4xOWgtMjUuNjV6IiBmaWxsPSIjNjI0YWZmIiAvPgo8L3N2Zz4K)](#demos)
Expand Down Expand Up @@ -34,7 +34,7 @@ Data-Juicer 是一个以数据为中心的文本处理系统,旨在为大语
* [数据可视化](#数据可视化)
* [构建配置文件](#构建配置文件)
* [预处理原始数据(可选)](#预处理原始数据(可选))
* [Documentation | 文档](#documentation-|-文档)
* [Documentation | 文档](#documentation--文档)
* [数据处理菜谱](#数据处理菜谱)
* [演示样例](#演示样例)
* [开源协议](#开源协议)
Expand All @@ -53,7 +53,7 @@ Data-Juicer 是一个以数据为中心的文本处理系统,旨在为大语

* **全面的处理菜谱**: 为预训练、SFT、中英文等场景提供数十种[预构建的数据处理菜谱](configs/refine_recipe/README_ZH.md)

* **用户友好**: 设计简单易用,提供全面的[文档](#documentation-|-文档)、简易[入门指南](#快速上手)[演示配置](configs/),并且可以轻松地添加/删除[现有配置](configs/config_all.yaml)中的算子。
* **用户友好**: 设计简单易用,提供全面的[文档](#documentation--文档)、简易[入门指南](#快速上手)[演示配置](configs/),并且可以轻松地添加/删除[现有配置](configs/config_all.yaml)中的算子。

* **灵活 & 易扩展**: 支持大多数数据格式(如jsonl、parquet、csv等),并允许灵活组合算子。支持[自定义算子](docs/DeveloperGuide_ZH.md#构建自己的算子),以执行定制化的数据处理。

Expand Down Expand Up @@ -164,7 +164,7 @@ streamlit run app.py

## Documentation | 文档

* [Overall](README.md) | [概览](README_ZH.md)
* [Overview](README.md) | [概览](README_ZH.md)
* [Operator Zoo](docs/Operators.md) | [算子库](docs/Operators_ZH.md)
* [Configs](configs/README.md) | [配置系统](configs/README_ZH.md)
* [Developer Guide](docs/DeveloperGuide.md) | [开发者指南](docs/DeveloperGuide_ZH.md)
Expand Down
Binary file removed configs/.DS_Store
Binary file not shown.
Binary file removed configs/refine_recipe/.DS_Store
Binary file not shown.
Binary file removed data_juicer/.DS_Store
Binary file not shown.
Binary file removed data_juicer/ops/.DS_Store
Binary file not shown.
Binary file removed demos/.DS_Store
Binary file not shown.
Binary file removed demos/data_visualization_diversity/.DS_Store
Binary file not shown.
Binary file removed demos/data_visualization_op_effect/.DS_Store
Binary file not shown.
Binary file removed demos/data_visualization_statistics/.DS_Store
Binary file not shown.
Binary file removed demos/tool_quality_classifier/.DS_Store
Binary file not shown.
Binary file removed docs/.DS_Store
Binary file not shown.
Binary file added docs/imgs/data-juicer.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/imgs/data-juicer.png
Binary file not shown.
Binary file removed tests/.DS_Store
Binary file not shown.
Binary file removed tests/ops/.DS_Store
Binary file not shown.
Binary file removed tools/.DS_Store
Binary file not shown.
Binary file removed tools/evaluator/.DS_Store
Binary file not shown.
2 changes: 1 addition & 1 deletion tools/quality_classifier/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ The quality classifiers here mainly refer to the GPT-3 quality classifier mentio
### Tokenizers

- Standard Tokenizer in Spark: split texts by whitespaces.
- zh/code.sp.model: trained using sentencepiece with BPE.
- zh/code.sp.model: trained using sentencepiece.

### Keep Methods
- label: `doc_score > 0.5`
Expand Down
2 changes: 1 addition & 1 deletion tools/quality_classifier/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ python eval.py --help
### Tokenizers

- Spark 中的标准 Tokenizer: 根据空白字符分割文本.
- zh/code.sp.model: 使用 sentencepiece BPE 训练得到。
- zh/code.sp.model: 使用 sentencepiece 训练得到。

### Keep Methods

Expand Down

0 comments on commit ffbb5bd

Please sign in to comment.