Reinforcement Learning from Human Feedback (RLHF) and Multimodal LLM Alignment: A Comprehensive Review
Jyotsna Shastry, Dr. Shweta Agrawal
10.7753/IJCATR1506.1001
keywords : Reinforcement Learning from Human Feedback, Large Language Models, Multimodal Alignment, Direct Preference Optimization, Safe RLHF, Reward Modeling, AI Safety
Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) with human values and preferences is one of the most important problems in AI research. Reinforcement Learning from Human Feedback (RLHF) is one of the best approaches for addressing these challenges and developing helpful, harmless, and honest AI systems aligned with human values. This paper presents a comprehensive review of RLHF and its multimodal extensions, including their theoretical foundations, algorithmic developments, and practical applications. This survey studies the development of RLHF from foundational reward modeling approaches to advanced methods such as PPO, DPO, RLHF-V, MM-RLHF, and FACT-RLHF. Safe RLHF and Safe RLHF-V are important safety-focused variants, and special attention is also given to red-teaming and robustness evaluation techniques. We also discuss open challenges such as reward hacking, scalable oversight, and integrating factual accuracy into preference learning. The main objective of this survey is to provide a conceptual map and technical reference for researchers and practitioners working on the alignment of foundation models.
@artical{j1562026ijcatr15061001,
Title = "Reinforcement Learning from Human Feedback (RLHF) and Multimodal LLM Alignment: A Comprehensive Review",
Journal ="International Journal of Computer Applications Technology and Research (IJCATR)",
Volume = "15",
Issue ="6",
Pages ="1 - 5",
Year = "2026",
Authors ="Jyotsna Shastry, Dr. Shweta Agrawal"}