Results 321 to 330 of about 2,022,788 (391)

Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

open access: yesarXiv.org
Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with reinforcement learning ...
Keertana Chidambaram   +2 more
semanticscholar   +4 more sources

Home - About - Disclaimer - Privacy