Rainbow-DemoRL: Combining Improvements in Demonstration-Augmented Reinforcement Learning

UC San Diego
Code arXiv W&B

Abstract

Several approaches have been proposed to improve the sample efficiency of online reinforcement learning (RL) by leveraging demonstrations collected offline. The offline data can be used directly as transitions to optimize the RL objectives, or offline policy and value functions can first be inferred from the data for online finetuning or to provide reference actions. While each of these strategies has shown compelling results, it is not clear which method has the most impact on the final performance, whether these approaches can be combined, and if there are cumulative benefits. We classify existing demonstration-augmented RL approaches into three categories and perform an extensive empirical study of their strengths, weaknesses, and combinations to identify the contribution of each strategy and identify effective hybrid combinations for sample-efficient online RL.

Method

Inspired by Rainbow-DQN, which clarified the importance of various extensions to the DQN algorithm, Rainbow-DemoRL performs a large-scale empirical study to identify how different strategies for leveraging offline demonstrations can be combined to maximize online reinforcement learning efficiency and performance.

A Taxonomy of Demonstration-Augmented RL

Taxonomy of strategies

We classify existing approaches into three distinct, non-mutually exclusive strategies.

Strategy A

TLDR: Using offline dataset directly

Direct Data Sampling

Uses offline data directly within the online training loop. This includes prefilling the replay buffer (like RLPD) or adding an auxiliary behavior cloning loss to the RL actor objective.

Strategy B

TLDR: Using pretrained model weights

Pretraining

Extracts an initial policy or value function, providing a strong starting point for online fine-tuning. This can be done via behavior cloning (BC), offline RL methods like CQL and CalQL, or via a simple recipe we propose, Monte-Carlo-Q (MCQ).

Strategy C

TLDR: Using outputs from pretrained models

Control Priors

Uses a pre-trained offline policy as a reference for online actions. This involves "mixing" actions via residual RL, uncertainty-based interpolation (CHEQ), or selection based on Q-values (IBRL).

Results

We evaluate hybrid algorithms combining strategies using Soft Actor-Critic (SAC) as the base online RL algorithm. For each task-robot setting, we present the Sample Efficiency Improvement (SEI) scores of hybrid algorithms compared to SAC. The SEI score represents the normalized percentage improvement in area-under-curve of success rate plots, averaged across all experimental settings. This allows the metric to capture both learning speed and final success rates.

Click around to see how strategy choices and task settings (env/robot) shift the SEI score distribution.

Average SEI (Selected): 0.00

From the results, the following conclusions are most useful for practitioners looking to improve sample efficiency of online RL with demonstrations:

Skip conservative critic pretraining Offline RL methods that learn conservative critics require a recalibration phase for online adaptation. This introduces a sample overhead that often outweighs its initialization benefits.
Use buffer prefill Prefilling the online agent's replay buffer with expert transitions and oversampling them (Strategy A) is the single most consistent driver of sample efficiency across all tasks.

Check out the full paper for analysis on the isolated impact of each strategy variant, and how it varies with task difficulty.

BibTeX

@misc{bhatt2025rainbow,
  title={Rainbow-DemoRL: Combining Improvements in Demonstration-Augmented Reinforcement Learning},
  author={Dwait Bhatt and Shih-Chieh Chou and Nikolay Atanasov},
  year={2025},
  url={https://dwaitbhatt.com/Rainbow-DemoRL/}
}