Research Papers Presented by Awarded Students at the Third Doctoral and Postdoctoral Daoyuan Forum

The Third Doctoral and Postdoctoral Daoyuan Academic Forum was successfully held on January 13th. It aims to create a broad space and a platform for exchange, facilitating intellectual clashes, experience sharing, and deepening collaboration to promote the improvement of doctoral students and postdoctoral groups in academic atmosphere, innovative ability, research quality, and research achievements. The forum was jointly organized by the Shenzhen Research Institute of Big Data and the Chinese University of Hong Kong, Shenzhen. It was initiated in November 2023 for paper submissions and received active participation and support from various departments.

During this forum, Yingru Li from the School of Science and Engineering (SSE) and Junwen Qiu from the School of Data Science (SDS) were awarded the first and second prizes, respectively, in the oral presentation category. Shaokui Wei and Ziniu Li, from SDS, received the first and second prizes, respectively, in the poster presentation category.

The following are the papers shared by the award-winning students.

In this poster presentation, we delve into the issue of backdoor attacks in machine learning models and introduce a novel method for purifying a backdoored model using a small clean dataset. Backdoor attacks involve an adversary's manipulation of the training set with poisoned samples to create a model that behaves normally on benign data but redirects specific trigger-embedded inputs to target classes. Our research connects the backdoor risk with adversarial risk and develops a new upper bound that focuses on shared adversarial examples (SAEs) between the contaminated and purified models. Leveraging this insight, we formulate a bi-level optimization problem for combating backdoors using adversarial training techniques. The proposed approach, Shared Adversarial Unlearning (SAU), operates in two stages: first, it generates SAEs; then, it strategically unlearns these SAEs to ensure they are correctly classified by the purified model or lead to dissimilar predictions between the two models. This process effectively mitigates the backdoor effect in the original model. Empirical evaluations across multiple benchmark datasets and network architectures demonstrate that our SAU method achieves state-of-the-art performance in defending against backdoor attacks.

Under resource constraints, reinforcement learning (RL) agents need to be simple, efficient and scalable with (1) large state space and (2) increasingly accumulated data of interactions when deploying in complex environments. We propose the HyperAgent, a RL framework with hypermodel, index sampling schemes and incremental update mechanism, enabling computation-efficient sequential posterior approximation and data-efficient action selection under general value function approximation beyond conjugacy. The implementation of HyperAgent is simple as it only adds one module and a line of code additional to DDQN.

Practically, HyperAgent demonstrates its robust performance in large-scale deep RL benchmarks with significant efficiency gain in terms of both data and computation. Theoretically, among the practically scalable algorithms, HyperAgent is the first achieving provably scalable per-step computational complexity as well as sublinear regret under tabular RL. The core of our theoretical analysis is the sequential posterior approximation argument. This is made possible by the first analytical tool for sequential random projection, a non-trivial martingale extension of the Johnson-Lindenstrauss lemma, which is of independent interest.

This work bridges the theoretical and practical realms of RL, establishing a new benchmark for RL algorithms design.

In this work, we present an unbiased stochastic proximal gradient method, namely the normal map-based algorithm (nor-SGD). The method is developed for nonsmooth nonconvex composite-type optimization problems and we also explore its convergence properties. Using the time window-based strategy and a suitable merit function, we first analyze the global convergence behavior of nor-SGD and show that every accumulation point of the generated sequence of iterates is a stationary point almost surely and in an expectation sense. The obtained results hold under standard assumptions and extend the more limited convergence guarantees of the basic proximal stochastic gradient method. In addition, based on the well-known Kurdyka-Lojasiewicz (KL) analysis framework, we provide novel point-wise convergence results for the iterates generated by nor-SGD. In the meanwhile, we have derived convergence rates that depend on the KL exponent and the step size dynamics. The obtained rates are faster than related and existing convergence rates for SGD in the nonconvex setting. The techniques studied in this work can be potentially applied to other families of stochastic and simulation-based algorithms.

Alignment is crucial for training large language models. The predominant strategy is Reinforcement Learning from Human Feedback (RLHF), with Proximal Policy Optimization (PPO) as the de-facto algorithm. Yet, PPO is known to struggle with computational inefficiency, a challenge that this paper aims to address. We identify three important properties of RLHF tasks: fast simulation, deterministic transitions, and trajectory-level rewards, which are not leveraged in PPO. Based on these properties, we develop ReMax, a new algorithm tailored for RLHF. The design of ReMax builds on the celebrated algorithm REINFORCE but is enhanced with a new variance-reduction technique.

ReMax offers threefold advantages over PPO: first, it is simple to implement with just 6 lines of code. It further eliminates more than 4 hyper-parameters in PPO, which are laborious to tune. Second, ReMaxreduces memory usage by removing the need of a value model used in PPO. To illustrate, PPO runs out of memory when directly fine-tuning a Llama2-7B model on A100-80GB GPUs, whereas ReMax can support the training. Even though memory-efficient techniques (e.g., ZeRO and offload) are employed for PPO to afford training, ReMax can utilize a larger batch size to increase throughput. Third, in terms of wall-clock time, PPO is about twice as slow as ReMax per iteration. Importantly, these improvements do not sacrifice task performance.

SRIBD News

Research Papers Presented by Awarded Students at the Third Doctoral and Postdoctoral Daoyuan Forum