Fundamental Theories and Algorithms of Large Models

Project Introduction

Since the release of ChatGPT, large models have made remarkable advancements in AI and other domains. However, both academia and industry still lack a foundational understanding of how these models operate, and current research on improving efficiency and reducing costs in large models relies mainly on heuristic engineering approaches. Due to the high costs of large-scale model experimentation, this approach leads to significant trial-and-error expenses, limiting the depth and scope of advancements in large model technology. To address this, the project aims to develop mathematical theories of large models, delve into the mechanisms underlying their performance, and create efficient foundational algorithms, thereby promoting scientific research and industrial applications of large models.

Key Research

The project focuses on two main areas:

Using mathematical tools such as optimization and statistics to model large models based on their structural characteristics and training methods. This involves analyzing mechanisms behind phenomena like general capability acquisition, domain-specific skill transfer, knowledge retention, and hallucinations, to build a foundational mathematical framework for large models, verified through systematic experiments.

Developing theoretically supported, efficient algorithms to improve model performance in training and inference, including memory-efficient pretraining, fine-tuning, alignment algorithms, anti-forgetting training algorithms, and hallucination mitigation algorithms.

Main Outcomes

1. Theory of Large Model Training Based on Hessian Matrix Block Heterogeneity

Transformer architectures are the leading network architecture in large models. Unlike convolutional neural networks (CNNs), which are mainly trained with stochastic gradient descent (SGD), Transformers typically rely on Adam(W) optimizers. SGD underperforms on Transformers compared to Adam, with unclear reasons that impact understanding of training mechanisms and optimizer selection. Through analyzing the Hessian matrix of Transformers, we found that different parameter modules in Transformers exhibit significant spectral differences, or "block heterogeneity," whereas CNNs show "block homogeneity." Block heterogeneity negatively impacts SGD performance because SGD applies a uniform learning rate across parameters, unable to adjust to each block’s specific traits, while Adam assigns an adaptive learning rate to each parameter, mitigating this issue.

We validated that block heterogeneity is a key factor causing SGD’s underperformance on Transformers across multiple architectures and training tasks. We also theoretically constructed a convex quadratic model with block heterogeneity, proving that Adam outperforms gradient descent (SGD’s prototype) when block heterogeneity is present. We further developed an inter-block JS distance metric to measure network block heterogeneity, enabling prediction of SGD's relative performance to Adam during model training and aiding in choosing the optimal optimizer. This result is published in the ICML 2024 conference [1].

Pic 1

Pic 2

2.Memory-Efficient Large Model Optimizer: Adam-mini

We proposed a new neural network training optimizer, "Adam-mini," which aims to reduce memory usage during training while maintaining or improving model performance. Adam-mini organizes model parameters into blocks based on Hessian structure, assigning a single learning rate to each block. This removes over 99.9% of adaptive learning rate settings from the standard AdamW algorithm, reducing memory costs by approximately 50%. Adam-mini performs comparably or even better than AdamW in large model training tasks, including GPT-2 and Llama series. It excels in tasks such as pretraining, supervised fine-tuning (SFT), and reinforcement learning with human feedback (RLHF), achieving a 49.6% increase in throughput and a 33% reduction in training time on a 2xA800-80GB GPU setup for Llama 2-7B model pretraining.

Pic 3

3. Efficient RLHF Algorithm for Large Models: ReMax

Reinforcement Learning from Human Feedback (RLHF) is critical for aligning large models with human preferences. By incorporating human feedback, RLHF enables models to produce content that aligns better with human needs and values, significantly enhancing controllability and practicality. However, the RLHF process is resource-intensive, requiring substantial GPU memory and computational power, which is a significant challenge in large-scale model training. This project proposes a new reinforcement learning method, "ReMax," designed to reduce memory and computational demands during RLHF training. ReMax leverages three RLHF characteristics—rapid simulation, deterministic state transitions, and trajectory-level rewards—to improve the classic REINFORCE algorithm, eliminating the need for complex value models in current RLHF approaches. This dramatically reduces memory usage, training time, and tuning difficulty.

Compared to the standard RLHF algorithm PPO, ReMax reduces GPU memory usage by 46% during the training of a 7-billion-parameter model, with no need for value model optimization, significantly lowering alignment training costs for large models. In experiments, using ReMax to conduct RLHF training on the Mistral-7B model achieved a 94.78% win rate and a score of 7.739 in AlpacaEval and MT-bench tests, setting a new benchmark for open-source 7B models. The ReMax algorithm’s related research has been published in the ICML 2024 international AI conference [3].

4. Anti-Forgetting Training Algorithm for Large Models: MoFO

This project introduces "MoFO" (Momentum-Filtered Optimizer), an algorithm designed to address knowledge forgetting in large model fine-tuning. Fine-tuning refers to the process of re-training a pre-trained large model with a small amount of data, a key method for enhancing model capabilities and adapting it to different domains. However, during fine-tuning, models may lose knowledge acquired in the pre-training phase, leading to a decline in general capabilities—a significant challenge for practical applications of large models.

MoFO mitigates forgetting by iteratively selecting and updating parameters with the highest momentum during training, keeping model parameters closer to the pre-trained state. This method effectively reduces forgetting while maintaining performance. In experiments with fine-tuning tasks for models like Llama-2-7B and TinyLlama-1.1B, MoFO not only achieved results comparable to standard full-parameter fine-tuning on downstream tasks but also preserved or enhanced general performance. MoFO showed improvements in tasks such as GSM8K math reasoning, ARC-Easy commonsense reasoning, and HumanEval code generation, outperforming full-parameter fine-tuning. The technical paper on MoFO is now publicly available [4].

[1] Zhang, Yushun, et al. "Why transformers need adam: A hessian perspective." ICML 2024.

[2] Zhang, Yushun, et al. "Adam-mini: Use fewer learning rates to gain more." arXiv preprint arXiv:2406.16793 (2024).

[3] Li, Ziniu, et al. "Remax: A simple, effective, and efficient method for aligning large language models." ICML 2024.

[4] Chen, Yupeng, et al. "MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning." arXiv preprint arXiv:2407.20999 (2024).

Application Scenarios and Impact

1. Hessian Matrix Block Heterogeneity Analysis Theory: This theory enables qualitative predictions about the performance of SGD versus Adam-type algorithms for large AI models, helping to select the appropriate training optimizer. It has potential applications across all large-model training scenarios.

2. Adam-mini: Suitable for training tasks across various deep learning architectures, including large language models, visual models, graph models, and diffusion models. Adam-mini has received widespread acclaim in the large model community, with over 10,000 downloads per month. Users include major tech companies' large-model training teams, such as the Meta Llama team, as well as university AI for Science research groups.

3. ReMax Algorithm: Widely applicable to model alignment scenarios, ReMax is particularly valuable in large model construction projects with limited computing resources or tight deadlines.

4. MoFO Algorithm: Useful in fine-tuning scenarios for large models, especially in adapting general models to specific domains. MoFO has already been implemented in Huawei GTS team’s large model products.

Collaboration

The project team comprises research scientists, professors, PhD students, and research assistants from Shenzhen Institute of Data Economy, Shenzhen International Center for Industrial and Applied Mathematics, and The Chinese University of Hong Kong, Shenzhen. Their research covers theoretical exploration, algorithm development, experimental analysis, and application transformation, ensuring both theoretical depth and practical value of outcomes. The team has collaborated on several projects with Huawei’s GTS division and 2012 Labs, helping to identify key challenges in large-model training and application, and successfully integrating project outcomes into real-world products.

Project & Service

Fundamental Theories and Algorithms of Large Models