Rainbow-DemoRL: Combining Improvements in Demonstration-Augmented Reinforcement Learning

UC San Diego

Abstract

Several approaches have been proposed to improve the sample efficiency of online reinforcement learning (RL) by leveraging demonstrations collected offline. The offline data can be used directly as transitions to optimize the RL objectives, or offline policy and value functions can first be inferred from the data for online finetuning or to provide reference actions. While each of these strategies has shown compelling results, it is not clear which method has the most impact on the final performance, whether these approaches can be combined, and if there are cumulative benefits. We classify existing demonstration-augmented RL approaches into three categories and perform an extensive empirical study of their strengths, weaknesses, and combinations to identify the contribution of each strategy and identify effective hybrid combinations for sample-efficient online RL.

Method

Inspired by Rainbow-DQN, which clarified the importance of various extensions to the DQN algorithm, Rainbow-DemoRL performs a large-scale empirical study to identify how different strategies for leveraging offline demonstrations can be combined to maximize online reinforcement learning efficiency and performance.

A Taxonomy of Demonstration-Augmented RL

Taxonomy of strategies

We classify existing approaches into three distinct, non-mutually exclusive strategies.

Strategy A

Direct Data Sampling

Uses offline data directly within the online training loop. This includes prefilling the replay buffer (with 50/50 sampling) or adding an auxiliary behavior cloning (BC) loss to the RL actor objective.

Strategy B

Pretraining

Extracts an initial policy or value function, providing a strong starting point for online fine-tuning. This can be done via offline RL - CQL or CalQL, or via a simple recipe we propose - Monte-Carlo-Q (MCQ).

Strategy C

Control Priors

Uses a pre-trained offline policy as a reference for online actions. This involves "mixing" actions via residual RL, uncertainty-based interpolation (CHEQ), or selection based on Q-values (IBRL).

Results

We evaluate hybrid algorithms combining strategies A+B, A+C, B+C, and A+B+C with several variants for each, using Soft Actor-Critic as the base online RL algorithm. For each task-robot setting, we show i) a comparison of the area under the success rate curve (AUC) across all methods and ii) the per-component impact.

From the results, the following conclusions are most useful for practitioners:

Key Finding 1: Prefill hybrids dominate the top 10 algorithms for all environments and robots considered. Hence the replay buffer prefill strategy which entails 50/50 sampling from offline data and online replay buffer along with critic ensembles and high UTD, is most impactful for sample-efficient online RL.
Key Finding 2: Prefill+MCQ variants are consistently the best performers for all of the tasks considered. Hence, pretraining with the proposed simple recipe, Monte-Carlo-Q (MCQ), outperforms pretraining with offline RL when combined with the prefill strategy.

BibTeX

@misc{bhatt2025rainbow,
  title={Rainbow-DemoRL: Combining Improvements in Demonstration-Augmented Reinforcement Learning},
  author={Dwait Bhatt and Shih-Chieh Chou and Nikolay Atanasov},
  year={2025},
  url={https://dwaitbhatt.com/Rainbow-DemoRL/}
}