Results 271 to 280 of about 233,719 (314)
Some of the next articles are maybe not open access.

Hacking

2023
This chapter aims to present a theoretical foundation on hacking, focusing on the perpetrator's profile, his modus operandi, and typologies. First, a conceptualization and characterization of the phenomenon's key terms is presented. Next, the chapter addresses the historical evolution of the perception of the phenomenon, from the moment of its ...
Carolina Roque   +3 more
openaire   +2 more sources

Reward Shaping to Mitigate Reward Hacking in RLHF

arXiv.org
Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human values. However, RLHF is susceptible to \emph{reward hacking}, where the agent exploits flaws in the reward function rather than learning ...
Jiayi Fu   +5 more
semanticscholar   +1 more source

ODIN: Disentangled Reward Mitigates Hacking in RLHF

International Conference on Machine Learning
In this work, we study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback (RLHF) on LLMs. A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even
Lichang Chen   +8 more
semanticscholar   +1 more source

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

arXiv.org
Reward hacking--where agents exploit flaws in imperfect reward functions rather than performing tasks as intended--poses risks for AI alignment. Reward hacking has been observed in real training runs, with coding agents learning to overwrite or tamper ...
Mia Taylor   +4 more
semanticscholar   +1 more source

Hacked off

Nursing Standard, 1988
The new book on occupational health for nurses called Nurses At Risk written by [illegible word] Salvage, former Nursing Standard journalist and Rosemary Rogers, currently our Clinical News Editor, received widespread press coverage.
openaire   +2 more sources

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

Neural Information Processing Systems
Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models with human values, reward hacking, also termed reward overoptimization, remains a critical challenge.
Yuchun Miao   +5 more
semanticscholar   +1 more source

Feedback Loops With Language Models Drive In-Context Reward Hacking

International Conference on Machine Learning
Language models influence the external world: they query APIs that read and write to web pages, generate content that shapes human behavior, and run system commands as autonomous agents.
Alexander Pan   +3 more
semanticscholar   +1 more source

Book review: Christopher Hadnagy, Social Engineering: The Science of Human Hacking

European Journal of Research in Applied Sciences
Christopher Hadnagy, Social Engineering: The Science of Human Hacking (2nd ed.). John Wiley & Sons, Inc, 2018, 320 pp., ₹1,645.
Meghna Chukkath
semanticscholar   +1 more source

Modeling Malicious Hacking Data Breach Risks

North American Actuarial Journal, 2020
Malicious hacking data breaches cause millions of dollars in financial losses each year, and more companies are seeking cyber insurance coverage. The lack of suitable statistical approaches for scoring breach risks is an obstacle in the insurance ...
Hong Sun, Maochao Xu, Peng Zhao
semanticscholar   +1 more source

AI-powered growth hacking: benefits, challenges and pathways

Management Decision
Purpose This paper aims to (1) unveil how artificial intelligence (AI) can be implemented in growth-hacking strategies; and (2) identify the challenges and enabling factors associated with AI’s implementation in these strategies.Design/methodology ...
Gabriele Santoro   +3 more
semanticscholar   +1 more source

Home - About - Disclaimer - Privacy