Benchmark for Assessing Olfactory Perception of Large Language Models
arXiv:2604.00002v1 Announce Type: cross Abstract: Here we introduce the Olfactory Perception (OP) benchmark, designed to assess the capability of large language models (LLMs) to reason about smell. The benchmark contains 1,010 questions across eight task categories spanning odor classification, odor primary descriptor identification, intensity and pleasantness judgments, multi-descriptor prediction, mixture similarity, olfactory receptor activation, and smell identification from real-world odor sources. Each question is presented in two prompt formats, compound names and isomeric SMILES, to evaluate the effect of molecular representations. Evaluating 21 model configurations across major model families, we find that compound-name prompts consistently outperform isomeric SMILES, with gains ranging from +2.4 to +18.9 percentage points (mean approx +7 points), suggesting current LLMs access olfactory knowledge primarily through lexical associations rather than structural molecular reasoni
arXiv:2604.00002v1 Announce Type: cross Abstract: Here we introduce the Olfactory Perception (OP) benchmark, designed to assess the capability of large language models (LLMs) to reason about smell. The benchmark contains 1,010 questions across eight task categories spanning odor classification, odor primary descriptor identification, intensity and pleasantness judgments, multi-descriptor prediction, mixture similarity, olfactory receptor activation, and smell identification from real-world odor sources. Each question is presented in two prompt formats, compound names and isomeric SMILES, to evaluate the effect of molecular representations. Evaluating 21 model configurations across major model families, we find that compound-name prompts consistently outperform isomeric SMILES, with gains ranging from +2.4 to +18.9 percentage points (mean approx +7 points), suggesting current LLMs access olfactory knowledge primarily through lexical associations rather than structural molecular reasoning. The best-performing model reaches 64.4\% overall accuracy, which highlights both emerging capabilities and substantial remaining gaps in olfactory reasoning. We further evaluate a subset of the OP across 21 languages and find that aggregating predictions across languages improves olfactory prediction, with AUROC = 0.86 for the best performing language ensemble model. LLMs should be able to handle olfactory and not just visual or aural information.
Executive Summary
The article introduces the Olfactory Perception (OP) benchmark, a novel framework to evaluate large language models' (LLMs) ability to reason about olfactory information. Comprising 1,010 questions across eight task categories, the benchmark tests odor classification, descriptor identification, intensity/pleasantness judgments, and more. Evaluations of 21 model configurations reveal that compound-name prompts significantly outperform isomeric SMILES representations (+7% mean gain), indicating reliance on lexical over structural molecular reasoning. The top model achieves 64.4% accuracy, demonstrating emerging capabilities but also substantial gaps. Cross-language aggregation further improves performance (AUROC = 0.86). The study underscores the need for LLMs to integrate multi-sensory data beyond visual and auditory inputs, marking a critical step toward holistic AI cognition.
Key Points
- ▸ Introduces the first benchmark (OP) to systematically assess LLMs' olfactory reasoning capabilities across 1,010 questions in eight task categories.
- ▸ Demonstrates that compound-name prompts outperform isomeric SMILES by +7% mean gain, suggesting LLMs rely on lexical associations rather than structural molecular reasoning.
- ▸ Cross-language aggregation enhances performance (AUROC = 0.86), indicating multilingual ensembles improve olfactory prediction accuracy.
Merits
Pioneering Framework
The OP benchmark is the first comprehensive tool to evaluate LLMs' olfactory perception, filling a critical gap in AI reasoning capabilities.
Rigorous Evaluation
The study evaluates 21 model configurations across major families, providing robust comparative insights into their performance.
Multilingual Insights
Cross-language analysis reveals that aggregating predictions improves performance, offering novel perspectives on language-agnostic reasoning.
Demerits
Lexical Bias Limitation
The significant performance gap between compound-name and SMILES prompts suggests LLMs lack true structural molecular reasoning, relying instead on memorized lexical associations.
Modest Accuracy Gains
Despite progress, the top model's 64.4% accuracy highlights substantial room for improvement in olfactory reasoning.
Narrow Task Scope
The benchmark focuses on knowledge-based and classification tasks, potentially overlooking real-world olfactory interaction and adaptation.
Expert Commentary
This study represents a seminal contribution to the field of AI reasoning, particularly in addressing the long-standing neglect of olfactory perception in large language models. The OP benchmark’s design is meticulous, covering a broad spectrum of olfactory tasks that reflect both theoretical and practical challenges in odor perception. The finding that compound-name prompts outperform SMILES representations is particularly telling, as it exposes a fundamental limitation in current LLMs: their reliance on superficial lexical associations rather than deep structural understanding. This mirrors similar biases observed in other domains, such as legal or medical reasoning, where models often prioritize pattern matching over causal or mechanistic insights. The cross-language aggregation results are promising, suggesting that multilingual ensembles can mitigate some of these biases, though the underlying structural reasoning gap remains. For AI to achieve true multimodal cognition, future work must focus on integrating neurosymbolic approaches that combine symbolic knowledge (e.g., odor descriptors) with neural representations of molecular structures. The modest accuracy of the top-performing model underscores the urgency of this task, particularly for applications where olfactory reasoning is critical.
Recommendations
- ✓ Develop hybrid training methodologies that explicitly combine lexical and structural chemical data to improve LLMs' molecular reasoning capabilities.
- ✓ Expand the OP benchmark to include real-world olfactory interaction tasks, such as dynamic odor adaptation or context-dependent smell perception, to better reflect practical applications.
- ✓ Collaborate with neuroscientists and chemists to design biologically inspired olfactory models that can be integrated into LLMs, bridging the gap between human-like reasoning and current AI limitations.
- ✓ Establish standardized evaluation protocols for olfactory benchmarks to enable fair comparisons across models and encourage industry-wide adoption.
- ✓ Explore the use of reinforcement learning or active learning to fine-tune LLMs on olfactory tasks, leveraging human feedback to improve performance in niche domains like flavor science or environmental monitoring.
Sources
Original: arXiv - cs.AI