Adaptive Q-Aid for Conditional Supervised Learning
in Offline Reinforcement Learning

NeurIPS 2024

Jeonghye Kim1,  Suyoung Lee1,  Woojun Kim2,  Youngchul Sung1

1 KAIST  2 Carnegie Mellon University   

Abstract

Offline reinforcement learning (RL) has progressed with return-conditioned supervised learning (RCSL), but its lack of stitching ability remains a limitation. We introduce Q-Aided Conditional Supervised Learning (QCS), which effectively combines the stability of RCSL with the stitching capability of Q-functions. By analyzing Q-function over-generalization, which impairs stable stitching, QCS adaptively integrates Q-aid into RCSL's loss function based on trajectory return. Empirical results show that QCS significantly outperforms RCSL and value-based methods, consistently achieving or exceeding the highest trajectory returns across diverse offline RL benchmarks.


Summary

Conceptual idea of QCS: Follow RCSL when learning from optimal trajectories where it predicts actions confidently but the Q-function may stitch incorrectly. Conversely, refer to the Q-function when learning from sub-optimal trajectories where RCSL is less certain but the Q-function is likely accurate.

Despite its simplicity, the effectiveness of QCS is empirically substantiated across offline RL benchmarks, demonstrating significant advancements over existing SOTA methods, including both RCSL and value-based methods. Especially, QCS surpasses the maximal dataset trajectory return across diverse MuJoCo datasets, under varying degrees of sub-optimality.


When Is Q-Aid Beneficial for RCSL?

We see that the dataset quality favoring RCSL contrasts with that benefiting the Q-greedy policy. RCSL tends to perform well by mimicking actions in high-return trajectory datasets. On the other hand, the Q-greedy policy excels with suboptimal datasets but shows notably poor results with optimal datasets.

Why Does Q-Greedy Policy Struggle with Optimal Datasets?

The learned Q-function is prone to overgeneralization because it is trained on near-identical action values from optimal trajectories, leading to similar Q-values being assigned to OOD actions. As a result, the Q-function becomes noise-sensitive, potentially causing incorrect action values and shifts in state distribution during the test phase.


Q-Aided Conditional Supervised Learning

Given the complementary relationship where RCSL excels at mimicking optimal narrow datasets and the Q-function becomes a more effective critic when trained on diverse datasets with varied actions and Q-values, we can apply varying degrees of Q-aid based on the trajectory return for each sub-trajectory in RCSL.


Results: Overall Performance

MuJoCo

Antmaze


Bibtex

@inproceedings{
  kim2024adaptive,
  title={Adaptive $Q$-Aid for Conditional Supervised Learning in Offline Reinforcement Learning},
  author={Jeonghye Kim and Suyoung Lee and Woojun Kim and Youngchul Sung},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
  year={2024},
}
,