Results 331 to 340 of about 2,703,795 (408)
Some of the next articles are maybe not open access.
Humans or LLMs as the Judge? A Study on Judgement Bias
Conference on Empirical Methods in Natural Language ProcessingAdopting human and large language models (LLM) as judges (*a.k.a* human- and LLM-as-a-judge) for evaluating the performance of LLMs has recently gained attention.
Guiming Hardy Chen +4 more
semanticscholar +1 more source
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
International Conference on Learning RepresentationsLLM-as-a-Judge has been widely utilized as an evaluation method in various benchmarks and served as supervised rewards in model training. However, despite their excellence in many domains, potential issues are under-explored, undermining their ...
Jiayi Ye +11 more
semanticscholar +1 more source
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
Proceedings of the 2025 Conference on Empirical Methods in Natural Language ProcessingLarge Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et al., 2024) have shown that LLMs can improve by judging ...
Tianhao Wu +7 more
semanticscholar +1 more source
The Impact of Artificial Intelligence on the Right to a Fair Trial: Towards a Robot Judge?
Asian Journal of Law and Economics, 2020This paper seeks to examine the potential influences AI may have on the right to a fair trial when it is used in the courtroom. Essentially, AI systems can assume two roles in the courtroom.
Jasper Ulenaers
semanticscholar +1 more source
Annual Review of Psychology, 2020
Deceptive claims surround us, embedded in fake news, advertisements, political propaganda, and rumors. How do people know what to believe? Truth judgments reflect inferences drawn from three types of information: base rates, feelings, and consistency with information retrieved from memory.
Nadia M, Brashier, Elizabeth J, Marsh
openaire +2 more sources
Deceptive claims surround us, embedded in fake news, advertisements, political propaganda, and rumors. How do people know what to believe? Truth judgments reflect inferences drawn from three types of information: base rates, feelings, and consistency with information retrieved from memory.
Nadia M, Brashier, Elizabeth J, Marsh
openaire +2 more sources
Tenured public officials such as judges are often thought to be indifferent to theconcerns of the electorate and, as a result, potentially lacking in discipline butunlikely to pander to public opinion. We investigate this proposition empiricallyusing data on promotion decisions taken by senior English judges between 1985 and2005. Throughout this period
Jordi Blanes I Vidal, Clare Leaver
openaire +3 more sources
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
Conference on Empirical Methods in Natural Language ProcessingLarge language models (LLMs) have exhibited great potential in autonomously completing tasks across real-world applications. Despite this, these LLM agents introduce unexpected safety risks when operating in interactive environments. Instead of centering
Tongxin Yuan +11 more
semanticscholar +1 more source
No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding
arXiv.orgLLM-as-a-Judge is a framework that uses an LLM (large language model) to evaluate the quality of natural language text - typically text that is also generated by an LLM.
Michael Krumdick +4 more
semanticscholar +1 more source
Agent-as-a-Judge: Evaluate Agents with Agents
International Conference on Machine LearningContemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of agentic systems, or require excessive manual labour.
Mingchen Zhuge +12 more
semanticscholar +1 more source
Applying Code Quality Detection in Online Programming Judge
International Conferences on Intelligent Information Technology, 2020This article presents an enhanced programming online judge system that not only evaluates the correctness of the submitted program code but also detects its code quality.
Xiao Liu, G. Woo
semanticscholar +1 more source

