datawhalechina · 0-yy-0 · Oct 31, 2024
diff --git a/docs/content/ch04.md b/docs/content/ch04.md
@@ -386,7 +386,7 @@ $$
 输入 $x$ ：Thank you $<X>$ me to your party $<Y>$ week.
 输出 $y$ ： $<X>$ for inviting $<Y>$ last
 
-### 10.2.2 检索方法
+### 4.2.2 检索方法
 假设我们有一个存储库 $S$ ，它是一组序列（通常是文档或段落）的集合。
 
 $$
@@ -408,7 +408,7 @@ $$
 - 检索 $(x',y') \in S$ ，使得 $x'$ 和 $x $最相似。
 - 生成 $y = y'$ 。
 
-### 10.2.3 Retrieval-augmented generation (RAG) ([Lewis et al., 2020](https://arxiv.org/pdf/2005.11401.pdf))
+### 4.2.3 Retrieval-augmented generation (RAG) ([Lewis et al., 2020](https://arxiv.org/pdf/2005.11401.pdf))
 
 ![rag-architecture](images/rag-architecture.png)
 
@@ -420,7 +420,7 @@ $$
 
 在实践中， $\sum_{z \in S}$ 由前k个代替（类似于为混合专家选择前1个或2个专家）。
 
-#### 10.2.3.1 检索器
+#### 4.2.3.1 检索器
 
 Dense Passage Retrieval (DPR)** ([Karpukhin et al., 2020](https://arxiv.org/pdf/2004.04906.pdf))
 
@@ -433,7 +433,7 @@ $$
     - 负例：随机或者使用BM25检索出的不包含答案的段落
 - 推理：使用[FAISS](https://github.com/facebookresearch/faiss)（Facebook AI相似性搜索）
 
-#### 10.2.3.2 生成器
+#### 4.2.3.2 生成器
 
 $$
 p(y \mid z, x) = p(y \mid \text{concat}(z, x)).
@@ -442,12 +442,12 @@ $$
 - 使用BART-large（400M参数），其中输入为检索出的段落 $z$ 和输入 $x$
 - 回想一下，BART是基于网络、新闻、书籍和故事数据，使用去噪目标函数（例如，掩码）训练得到的
 
-#### 10.2.3.3 训练
+#### 4.2.3.3 训练
 
 - 用BART、DPR（用BERT初始化）初始化
 - 训练 $\text{BART}$和$\text{BERT}_\text{q}$
 
-#### 10.2.3.4 实验
+#### 4.2.3.4 实验
 
 - 在Jeopardy问题生成任务上，输入Hemingway的检索结果：
 
@@ -458,28 +458,28 @@ $$
 
 这里引用GPT-3 few-shot的结果进行比较：NaturalQuestions (29.9%), WebQuestions (41.5%), TriviaQA (71.2%)
 
-### 10.2.4 RETRO ([Borgeaud et al., 2021](https://arxiv.org/pdf/2112.04426.pdf))
+### 4.2.4 RETRO ([Borgeaud et al., 2021](https://arxiv.org/pdf/2112.04426.pdf))
 
 - 基于32个token的块进行检索
 - 存储库：2 trillion tokens
 - 70亿参数（比GPT-3少25倍）
 - 使用冻结的BERT进行检索（不更新）
 - 在MassiveText上训练（与训练Gopher的数据集相同）
 
-#### 10.2.4.1 实验结果
+#### 4.2.4.1 实验结果
 
 - 在语言建模方面表现出色
 - NaturalQuestions准确率：45.5%（SOTA为54.7%）
 
 ![retro-lm-results](images/retro-lm-results.png)
 
-### 10.2.5 讨论
+### 4.2.5 讨论
 
 - 基于检索的模型高度适合知识密集型的问答任务。
 - 除了可扩展性之外，基于检索的模型还提供了可解释性和更新存储库的能力。
 - 目前尚不清楚这些模型是否具有与稠密Transformer相同的通用能力。
 
-## 10.3 总体总结
+## 4.3 总体总结
 
 - 为了扩大模型规模，需要改进稠密Transformer。
 - 混合专家和基于检索的方法相结合更有效。

diff --git a/docs/content/ch06.md b/docs/content/ch06.md
@@ -339,7 +339,8 @@ Adam将存储从2倍的模型参数（ $\theta_t,g_t$ ）增加到了4倍（ $\t
 
 ## 延伸阅读
 
-- [混合精度训练](https://lilianweng.github.io/lil-log/2021/09/24/train-large-neural-networks.html#mixed-precision-training)
+- [混合精度训练](https://lilianweng.github.io/posts/2021-09-25-train-large/#mixed-precision-training) 
+- [Mixed Precision Training](https://arxiv.org/pdf/1710.03740.pdf). Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu. ICLR 2018.
 - [Fixing Weight Decay Regularization in Adam](https://arxiv.org/pdf/1711.05101.pdf). I. Loshchilov, F. Hutter. 2017. 介绍了AdamW
 - [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://arxiv.org/pdf/2003.10555.pdf). Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. ICLR 2020.
 - [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/pdf/2006.03654.pdf). Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. ICLR 2020.