300.txt

{"input": "Despite their wide adoption, the underlying training and memorization dynamicsof very large language models is not well understood. We empirically study exact memorization in causal and masked language modeling, across model sizes and throughout the training process. We measure the effects of dataset size, learning rate, and model size on memorization, finding that larger language models memorize training data faster across all settings. Surprisingly, we show that larger models can memorize a larger portion of the data before over-fitting and tend to forget less throughout the training process. We also analyze the memorization dynamics of different parts of speech and find that models memorize nouns and numbers first; we hypothesize and provide empirical evidence that nouns and numbers act as a unique identifier for memorizing individual training examples. Together, these findings present another piece of the broader puzzle of trying to understand what actually improves as models get bigger.", "question": "这篇论文主要内容是什么", "<ans>": "我们对因果和掩蔽语言建模中的精确记忆进行了实证研究，涵盖了模型大小和整个训练过程。我们测量了数据集大小、学习速率和模型大小对记忆的影响，发现较大的语言模型在所有设置中更快地记忆训练数据。令人惊讶的是，我们发现较大的模型可以在过度拟合之前记住更大部分的数据，并且在整个训练过程中忘记的较少。我们还分析了不同词类的记忆动态，发现模型首先记忆名词和数字；我们假设并提供了经验证据，证明名词和数字作为记忆单个训练样本的唯一标识符。总的来说，这些发现提出了另一个更广泛的难题，即试图理解随着模型变大，什么实际上得到了改善。"}
{"input": "Memorization in Language Models: Unintended memorization is a known challenge for language models [12, 13], which makes them open to extraction attacks [14, 15] and membership inference attacks [16, 17], although there has been work on mitigating these vulnerabilities [11, 18]. Recent work has argued that memorization is not exclusively harmful, and can be crucial for certain types of generalization (e.g., on QA tasks) [19, 20, 21], while also allowing the models to encode significant amounts of world or factual knowledge [22, 23, 24]. There is also a growing body of work analyzing fundamental properties of memorization in language models [9, 8, 10]. Most related to our work [8] analyzes memorization of fully trained language models and observes a dependence on model scale, training data duplication, and prompting context length. While we also study scaling behavior, our focus instead is on the memorization dynamics throughout training. Forgetting in Language Models: There has also been work studying memory degradation (forgetting) in language models. Catastrophic forgetting or catastrophic interference, first reported in [25, 26], studies how neural networks tend to forget the information from previous trained tasks or training batches, when trained on new data. This provides a key challenge for continual learning (or life-long learning) [27], where the goal is to gradually learn from a single pass over a, typically very large, stream of data. A number of mechanisms have been proposed for increasing robustness against catastrophic forgetting [28, 29, 30, 31, 32, 33]. There is also a growing body of work demonstrating that model scale and dataset scale can make models more resistant to forgetting [34, 35], as well as work characterizing how forgetting naturally occurs in image classifiers [36]. Machine unlearning is a technique that forces a trained model to forget a previously learned sample [37, 38], which is primarily motivated by data protection and privacy regulations [39, 40, 41, 42]. Our work is unique in its focus on measuring forgetting during training, and quantifying how it varies with scale. Scaling Laws: We have consistently seen performance gains by scaling model size [5, 6, 7, 43, 44], and scale itself has been known to push internal model behavior away from classical bias-variance regimes [45]. Recent efforts have focused on trying to model the scaling laws for language models, including data and model size [46, 47], applications to transfer learning [48], routing networks [49], and various autoregressive generative tasks [50]. While the bulk of work in scaling laws has been empirical, an interesting line of work focuses on theoretically explaining neural scaling laws [51]. Most scaling laws focus exclusively on cross-entropy loss, while we instead study memorization as models scale, which we define formally in § 3.",  "question": "这篇论文的理论背景是什么",  "<ans>": "（1）语言模型中的记忆:无意识的记忆是语言模型的一个已知挑战，这使得它们容易受到提取攻击和成员推理攻击。最近的研究表明，记忆不仅仅是有害的，而且对于某些类型的概括(例如在问答任务中)也是至关重要的，同时还允许模型对大量的世界或事实知识进行编码。本文重点是整个训练过程中的记忆动态。（2）语言模型中的遗忘:也有研究语言模型中记忆退化(遗忘)的工作。灾难性遗忘或灾难性干扰。研究当对新数据进行训练时，神经网络如何倾向于忘记来自先前训练任务或训练批次的信息。这为持续学习(或终身学习)提供了一个关键的挑战。已有研究提出了许多机制来增加对灾难性遗忘的鲁棒性。也有越来越多的工作表明，模型规模和数据集规模可以使模型更能抵抗遗忘，以及表征遗忘如何在图像分类器中自然发生的工。（3）缩放定律:通过缩放模型大小来提高性能；一个有趣的工作方向是从理论上解释神经标度律，大多数标度定律只关注交叉熵损失，而本文将记忆作为模型标度来研究。"}
{"input": "3 Experimental Setup In order to perform a large-scale study of the dynamics of memorization over training, our memorization metric must be reasonably easy to compute but also precise enough to tell us how much the model will actually remember from the training data. Label memorization is an ideal candidate, because it has consistently provided theoretical insight into underlying properties of neural networks, remains applicable in empirical settings, and is relatively cheap to compute. We formulate our metric as an analog of label memorization for self-supervised settings. Definition 1 Let V denote the vocabulary size. Let C denote a set of contexts, which can be thought of as a list of tuples (s, y) where s is an input context (incomplete block of text) and y is the index of the ground truth token in the vocabulary that completes the block of text. Let S denote the set of input contexts, and let f : S → RV denote a language model. A context c = (s, y) ∈ C is memorized if argmax(f(s)) = y. Note that a single word can appear as the ground-truth token for multiple contexts. For a given set of contexts C (i.e a given training dataset), we can then analyze the proportion of memorized contexts P(s,y)∈C 1{argmax(f(s)) = y} M(f) = |C| We refer to this as exact memorization, although it can also be seen as accuracy since we measure how often the argmax of the language model matches the ground truth token. Throughout this work, when we refer to memorization, we will be referring to Definition 1 unless we specify otherwise. We define τ to be a threshold value for M(f), and denote T(N, τ ) as the minimal number of times a language model f with N parameter needs to see each training datapoint in order to satisfy M(f) ≥ τ . When leveraging bigger datasets, we introduce Tupdate(N, τ ) as the minimal number of gradient descent updates U a language model f with N parameters needs to perform, to satisfy Mupdate(f, U) ≥ τ , where Mupdate(f, U) is defined as the memorization on the batch of data on which the model performs the U’th gradient descent update. Previous work analyzing language modeling memorization defines memorization differently. Motivated by privacy concerns, both [8] and [14] define memorization from a training data extraction standpoint, in which a string s is extractable if it can be produced by interacting with the language model. More specifically, [14] defines a string s as being k-eidetic memorized if it is extractable and appears in at most k training examples. [8] defines a string s as k-memorized if the language model can produce it via prompting with k tokens of context from training data. This definition only works for causal language modeling because of the dependence on prompting with training data; for masked language modeling [8] uses Definition 1 above. Note that if an example is exactly memorized, it is extractable by definition. In other words, both the set of k-eidetic memorized tokens and the set of k-memorized tokens contain the set of exactly memorized tokens (formally, different exactly memorized tokens may be contained in different sets, depending on k). Therefore, analyzing exact memorization gives a type of lower bound on the k-eidetic memorization and k-memorization. In a different line of work motivated by estimating the influence of individual training examples, [9] defines a training example x as memorized if the difference in expected model performance (where model performance is defined as M(f) above) over subsets of data including x and subsets of data not including x, is sufficiently large. This definition pulls from previous work in theoretically analyzing label memorization in classification settings [52]. Model Architectures: We replicate publicly available references for Transformer language model architectures [53, 54]. We use the 125 million, 355 million, 1.3 billion, 2.7 billion, 6.7 billion, and 13 billion model configurations (see § A.4 for more explicit architecture and hyperparameter configurations). We fix these architectures across all experiments. We train using the FairSeq framework [55] with PyTorch [56] as the underlying framework. For our larger models, we use the fully sharded data-parallel implementation available in FairScale [57] and use Aim experiment tracking [58].Datasets: We use two existing datasets across all our experiments: the WIKITEXT-103 benchmark containing around 103 million tokens [59], and the RoBERTa corpus [60] used to train the original RoBERTa model, containing around 39 billion tokens (we refer to this as the ROBERTA dataset). ", "question": "这篇论文实验步骤是什么", "<ans>": "实验步骤：（1）各项参数定义，包括定义 1 让 V 表示词汇量。（2）准备模型架构：复制Transformer语言模型的架构，包括1.25亿等不同配置。（3）准备数据集：使用两个现有的数据集 WIKITEXT-103 benchmark和 RoBERTa corpus。"}
{"input": "We study the properties of memorization dynamics over language model training and demonstrate that larger models memorize faster. We also measure the properties of forgetting curves and surprisingly find that forgetting reaches a baseline, which again increases with the model scale. Combined with memorization analyses that expose the unintuitive behavior of language models, we hope to motivate considering memorization as a critical metric when increasing language model scale.Most work studying memorization in language modeling is primarily motivated by privacy (see § 2). While theoretically, there are well-established frameworks to quantify privacy such as differential privacy [70], empirical privacy in language modeling is not well-defined — does memorizing common knowledge count as information leakage? Does outputting a synonym count as harmful memorization? As per our Definition 1, we implicitly focus on information that is sensitive if outputted verbatim (phone numbers, SSNs, addresses, medical diagnoses, etc.), rather than capturing all aspects of privacy. It is also known that text data used for training language models contain certain biases and stereotypes (e.g., [71]); therefore, our work has similar implications for how long language models can train before they definitively memorize these biases from training data. We also hope our work highlights the importance of analyzing memorization dynamics as we scale up language models, instead of only reporting cross entropy. Cross-entropy loss and memorization capture different behavior — for example, in many of our memory degradation experiments, even though memorization approaches a baseline, we observe that perplexity is still increasing. This implies that the model is becoming unconfident about the exact predictions, which we can only conclude because we inspect loss and memorization. More importantly, the forgetting baseline behavior would be entirely obscured if we did not inspect memorization dynamics. Similarly, there are multiple instances where we uncover interesting behavior because we focus on memorization dynamics (§ 4.4, § 4.3, § A.3), rather than focusing only on cross-entropy loss.", "question": "这篇论文的结论和研究价值是什么", "<ans>": "本文证明了较大的模型记忆得更快。本文还测量了遗忘曲线的性质，并惊奇地发现遗忘达到了一个基线，该基线又随着模型规模的增加而增加。结合揭示语言模型非直观行为的记忆分析，我们希望在增加语言模型规模时，鼓励将记忆作为一个关键指标。语言建模中的经验隐私没有很好的定义——背常识算不算信息泄露？输出一个同义词算不算有害记忆？根据本文的定义 1, 本文含蓄地关注逐字输出的敏感信息(电话号码、社会保障号、地址、医疗诊断等。)，而不是捕捉隐私的方方面面。还已知用于训练语言模型的文本数据包含某些偏见和定型(例如[71]);因此，本文对于语言模型在从训练数据中明确记住这些偏见之前可以训练多长时间具有类似的意义。本文强调了当扩大语言模型时分析记忆动态的重要性，而不仅仅是报告交叉熵。交叉熵损失和记忆捕获不同的行为。"}