Results 11 to 20 of about 77,374,412 (333)
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [PDF]
Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities.
Carlos E. Jimenez +6 more
semanticscholar +1 more source
Toolformer: Language Models Can Teach Themselves to Use Tools [PDF]
Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much ...
Timo Schick +7 more
semanticscholar +1 more source
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot [PDF]
We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy.
Elias Frantar, Dan Alistarh
semanticscholar +1 more source
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs [PDF]
Empowering large language models to accurately express confidence in their answers is essential for trustworthy decision-making. Previous confidence elicitation methods, which primarily rely on white-box access to internal model information or model fine-
Miao Xiong +6 more
semanticscholar +1 more source
Large Language Models Can Be Easily Distracted by Irrelevant Context [PDF]
Large language models have achieved impressive performance on various natural language processing tasks. However, so far they have been evaluated primarily on benchmarks where all information in the input context is relevant for solving the task. In this
Freda Shi +7 more
semanticscholar +1 more source
Can Large Language Models Be an Alternative to Human Evaluations? [PDF]
Human evaluation is indispensable and inevitable for assessing the quality of texts generated by machine learning models or written by humans. However, human evaluation is very difficult to reproduce and its quality is notoriously unstable, hindering ...
Cheng-Han Chiang, Hung-yi Lee
semanticscholar +1 more source
HellaSwag: Can a Machine Really Finish Your Sentence? [PDF]
Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as “A woman sits at a piano,” a machine must select the most likely followup: “She sets her fingers on the keys.” With ...
Rowan Zellers +4 more
semanticscholar +1 more source
Towards VQA Models That Can Read [PDF]
Studies have shown that a dominant class of questions asked by visually impaired users on images of their surroundings involves reading text in the image. But today’s VQA models can not read!
Amanpreet Singh +7 more
semanticscholar +1 more source
CSPNet: A New Backbone that can Enhance Learning Capability of CNN [PDF]
Neural networks have enabled state-of-the-art approaches to achieve incredible results on computer vision tasks such as object detection. However, such success greatly relies on costly computation resources, which hinders people with cheap devices from ...
Chien-Yao Wang +5 more
semanticscholar +1 more source
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them [PDF]
BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models.
Mirac Suzgun +10 more
semanticscholar +1 more source

