Results 231 to 240 of about 442,274 (281)
Some of the next articles are maybe not open access.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Neural Information Processing Systems, 2023While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining
Rafael Rafailov +5 more
semanticscholar +1 more source
Eureka: Human-Level Reward Design via Coding Large Language Models
International Conference on Learning Representations, 2023Large Language Models (LLMs) have excelled as high-level semantic planners for sequential decision-making tasks. However, harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning, remains an open problem.
Y. Ma +8 more
semanticscholar +1 more source
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
arXiv.org, 2023Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often ...
Jacob Eisenstein +11 more
semanticscholar +1 more source
Journal of the American College of Radiology, 2011
For much of the 20th century, psychologists and economists operated on the assumption that work is devoid of intrinsic rewards, and the only way to get people to work harder is through the use of rewards and punishments. This so-called carrot-and-stick model of workplace motivation, when applied to medical practice, emphasizes the use of financial ...
Richard B, Gunderman, Aaron P, Kamer
openaire +2 more sources
For much of the 20th century, psychologists and economists operated on the assumption that work is devoid of intrinsic rewards, and the only way to get people to work harder is through the use of rewards and punishments. This so-called carrot-and-stick model of workplace motivation, when applied to medical practice, emphasizes the use of financial ...
Richard B, Gunderman, Aaron P, Kamer
openaire +2 more sources
Neuroscience Letters, 2008
We report a highly significant regional increase of the BOLD response in the caudate nucleus in a group of Danish Christians while performing silent religious prayer. The effect was found in a main-effect analysis of high-structured and low-structured religious recitals relative to comparable secular recitals and to a non-narrative baseline.
Schjødt, Uffe +3 more
openaire +4 more sources
We report a highly significant regional increase of the BOLD response in the caudate nucleus in a group of Danish Christians while performing silent religious prayer. The effect was found in a main-effect analysis of high-structured and low-structured religious recitals relative to comparable secular recitals and to a non-narrative baseline.
Schjødt, Uffe +3 more
openaire +4 more sources
Trends in Neurosciences, 2003
Advances in neurobiology permit neuroscientists to manipulate specific brain molecules, neurons and systems. This has lead to major advances in the neuroscience of reward. Here, it is argued that further advances will require equal sophistication in parsing reward into its specific psychological components: (1) learning (including explicit and implicit
Kent C, Berridge, Terry E, Robinson
openaire +2 more sources
Advances in neurobiology permit neuroscientists to manipulate specific brain molecules, neurons and systems. This has lead to major advances in the neuroscience of reward. Here, it is argued that further advances will require equal sophistication in parsing reward into its specific psychological components: (1) learning (including explicit and implicit
Kent C, Berridge, Terry E, Robinson
openaire +2 more sources
SimPO: Simple Preference Optimization with a Reference-Free Reward
Neural Information Processing SystemsDirect Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability.
Yu Meng, Mengzhou Xia, Danqi Chen
semanticscholar +1 more source
ToolRL: Reward is All Tool Learning Needs
arXiv.orgCurrent Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. However, SFT struggles to generalize to unfamiliar or complex tool use scenarios.
Cheng Qian +7 more
semanticscholar +1 more source
2019
Neurons throughout frontal cortex show robust responses to rewards, but a challenge is determining the specific function served by these different reward signals. Most neuropsychiatric disorders involve dysfunction of circuits between frontal cortex and subcortical structures, such as the striatum. There are multiple frontostriatal loops, and different
openaire +3 more sources
Neurons throughout frontal cortex show robust responses to rewards, but a challenge is determining the specific function served by these different reward signals. Most neuropsychiatric disorders involve dysfunction of circuits between frontal cortex and subcortical structures, such as the striatum. There are multiple frontostriatal loops, and different
openaire +3 more sources
Behavioral and Brain Sciences, 2020
Abstract The costs of and returns from actions are varied and individually concrete dimensions, combined in heterogeneous ways. The many needs of the body also fluctuate. Making action selection efficiently track some ultimate goal, whether fitness or another utility function, itself requires representational abstraction.
openaire +2 more sources
Abstract The costs of and returns from actions are varied and individually concrete dimensions, combined in heterogeneous ways. The many needs of the body also fluctuate. Making action selection efficiently track some ultimate goal, whether fitness or another utility function, itself requires representational abstraction.
openaire +2 more sources

