Aligning Language Models with Human Preferences - MSc Thesis

This repository contains my thesis for the Master's Degree in Computer Science and Engineering @ IST.

Authors	Links
Martim Santos	https://martimfasantos.github.io/

Supervisors	Links
André F. T. Martins	https://andre-martins.github.io/
Sweta Agrawal	https://sweta20.github.io/

Final Grade: 18 / 20

Abstract

Large language models (LLMs) are characterized by their remarkable ability to learn extensive world knowledge and generate human-like text across diverse applications. However, the generated text often contains misleading and toxic content, emphasizing the need to align LLMs with human values and preferences to ensure more useful and secure AI systems. A widely employed strategy in numerous prominent models, including OpenAI’s GPT-3.5 and GPT-4, involves Reinforcement Learning from Human Feedback (RLHF). While this method has demonstrated impressive outcomes, RLHF’s complexity, instability, and sensitivity to hyperparameters challenge its empirical success and usability across various real-life scenarios. Recent reinforcement learning-free (RL-free) approaches — such as DPO, CPO, SimPO, and SLiC — address these issues. In this study, we investigate whether the promising results of RL-free methods observed in larger models extend to small language models (SLMs). Focusing on machine translation and summarization, we assess the ability of these models to efficiently learn human preferences by evaluating the quality and human alignment of their outputs, as well as their capacity to avoid common biases. Specifically, we train three compact baseline models — TinyLlama 1.1B, Gemma-2 2B, and EuroLLM 1.7B — with several RL-free methods and compare their performance against baselines. By evaluating the effectiveness of RL-free methods on smaller LLMs, this work is the first to provide a comprehensive comparison of several feedback methods applied to state-of-the-art small language models (SLMs), contributing to the development of secure and accessible AI systems suitable for resource-constrained environments.

Additional Key Words and Phrases: Language Models · Fine-tuning · Human Preferences · Alignment · Machine Translation · Summarization

Models

🤖 All trained models are available here.

Dissertation

Dissertation
Extended Abstract

Final MSc Degree Grade: 18 / 20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Aligning Language Models with Human Preferences - MSc Thesis

Abstract

Models

Dissertation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Aligning Language Models with Human Preferences - MSc Thesis

Abstract

Models

Dissertation