MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI [PDF]
We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning.
Xiang Yue +21 more
semanticscholar +1 more source
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering [PDF]
When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT).
Pan Lu +8 more
semanticscholar +1 more source
Improving Factuality and Reasoning in Language Models through Multiagent Debate [PDF]
Large language models (LLMs) have demonstrated remarkable capabilities in language generation, understanding, and few-shot learning in recent years. An extensive body of work has explored how their performance may be further improved through the tools of
Yilun Du +4 more
semanticscholar +1 more source
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity [PDF]
This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application ...
Yejin Bang +12 more
semanticscholar +1 more source
PIQA: Reasoning about Physical Commonsense in Natural Language [PDF]
To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to today's natural language understanding systems.
Yonatan Bisk +4 more
semanticscholar +1 more source
How the Custom Suppresses the Endowment Effect: Exchange Paradigm in Kanak Country
In this paper, Knetsch's exchange paradigm is analyzed from the perspective of pragmatics and social norms. In this paradigm the participant, at the beginning of the experiment, receives an object from the experimenter and at the end, the same ...
Jean Baratgin +5 more
doaj +1 more source
According to the weak version of linguistic relativity, also called the Sapir-Whorf hypothesis, the features of an individual’s native language influence his worldview and perception.
Jing Shao +4 more
doaj +1 more source
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning [PDF]
Charts are very popular for analyzing data. When exploring charts, people often ask a variety of complex reasoning questions that involve several logical and arithmetic operations. They also commonly refer to visual features of a chart in their questions.
Ahmed Masry +4 more
semanticscholar +1 more source
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
We introduce GQA, a new dataset for real-world visual reasoning and compositional question answering, seeking to address key shortcomings of previous VQA datasets.
Drew A. Hudson, Christopher D. Manning
semanticscholar +1 more source
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models [PDF]
Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks which requires solving problems harder than the exemplars shown in the prompts.
Denny Zhou +9 more
semanticscholar +1 more source

