Adaptive Q-Aid for Conditional Supervised Learning
in Offline Reinforcement Learning
NeurIPS 2024
Jeonghye Kim1, Suyoung Lee1, Woojun Kim2, Youngchul Sung1
1 KAIST 2 Carnegie Mellon University
Abstract
Offline reinforcement learning (RL) has progressed with return-conditioned supervised learning (RCSL), but its lack of stitching ability remains a limitation. We introduce Q-Aided Conditional Supervised Learning (QCS), which effectively combines the stability of RCSL with the stitching capability of Q-functions. By analyzing Q-function over-generalization, which impairs stable stitching, QCS adaptively integrates Q-aid into RCSL's loss function based on trajectory return. Empirical results show that QCS significantly outperforms RCSL and value-based methods, consistently achieving or exceeding the highest trajectory returns across diverse offline RL benchmarks.
Summary
Conceptual idea of QCS: Follow RCSL when learning from optimal trajectories where it predicts actions confidently but the Q-function may stitch incorrectly. Conversely, refer to the Q-function when learning from sub-optimal trajectories where RCSL is less certain but the Q-function is likely accurate.
Despite its simplicity, the effectiveness of QCS is empirically substantiated across offline RL benchmarks, demonstrating significant advancements over existing SOTA methods, including both RCSL and value-based methods. Especially, QCS surpasses the maximal dataset trajectory return across diverse MuJoCo datasets, under varying degrees of sub-optimality.
When Is Q-Aid Beneficial for RCSL?
We see that the dataset quality favoring RCSL contrasts with that benefiting the Q-greedy policy. RCSL tends to perform well by mimicking actions in high-return trajectory datasets. On the other hand, the Q-greedy policy excels with suboptimal datasets but shows notably poor results with optimal datasets.
Why Does Q-Greedy Policy Struggle with Optimal Datasets?
The learned Q-function is prone to overgeneralization because it is trained on near-identical action values from optimal trajectories, leading to similar Q-values being assigned to OOD actions. As a result, the Q-function becomes noise-sensitive, potentially causing incorrect action values and shifts in state distribution during the test phase.
Q-Aided Conditional Supervised Learning
Given the complementary relationship where RCSL excels at mimicking optimal narrow datasets and the Q-function becomes a more effective critic when trained on diverse datasets with varied actions and Q-values, we can apply varying degrees of Q-aid based on the trajectory return for each sub-trajectory in RCSL.
Results: Overall Performance
MuJoCo
Antmaze
Bibtex
@inproceedings{ kim2024adaptive, title={Adaptive $Q$-Aid for Conditional Supervised Learning in Offline Reinforcement Learning}, author={Jeonghye Kim and Suyoung Lee and Woojun Kim and Youngchul Sung}, booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems}, year={2024}, }