Learning What Matters Now: Dynamic Preference Inference under Contextual Shifts
arXiv:2603.22813v1 Announce Type: new Abstract: Humans often juggle multiple, sometimes conflicting objectives and shift their priorities as circumstances change, rather than following a fixed objective function. In contrast, most computational decision-making and multi-objective RL methods assume static preference weights or a known scalar reward. In this work, we study sequential decision-making problem when these preference weights are unobserved latent variables that drift with context. Specifically, we propose Dynamic Preference Inference (DPI), a cognitively inspired framework in which an agent maintains a probabilistic belief over preference weights, updates this belief from recent interaction, and conditions its policy on inferred preferences. We instantiate DPI as a variational preference inference module trained jointly with a preference-conditioned actor-critic, using vector-valued returns as evidence about latent trade-offs. In queueing, maze, and multi-objective continuou
arXiv:2603.22813v1 Announce Type: new Abstract: Humans often juggle multiple, sometimes conflicting objectives and shift their priorities as circumstances change, rather than following a fixed objective function. In contrast, most computational decision-making and multi-objective RL methods assume static preference weights or a known scalar reward. In this work, we study sequential decision-making problem when these preference weights are unobserved latent variables that drift with context. Specifically, we propose Dynamic Preference Inference (DPI), a cognitively inspired framework in which an agent maintains a probabilistic belief over preference weights, updates this belief from recent interaction, and conditions its policy on inferred preferences. We instantiate DPI as a variational preference inference module trained jointly with a preference-conditioned actor-critic, using vector-valued returns as evidence about latent trade-offs. In queueing, maze, and multi-objective continuous-control environments with event-driven changes in objectives, DPI adapts its inferred preferences to new regimes and achieves higher post-shift performance than fixed-weight and heuristic envelope baselines.
Executive Summary
This article addresses a critical gap in computational decision-making by proposing Dynamic Preference Inference (DPI), a novel framework that accounts for dynamic, context-dependent preference shifts—a phenomenon absent in conventional static-reward models. The authors introduce a cognitively inspired mechanism where an agent maintains probabilistic beliefs about latent preference weights, updates them via interaction data, and integrates these inferred preferences into policy decisions using a variational inference module coupled with a preference-conditioned actor-critic. Empirical evaluations across diverse environments—queueing, maze, and continuous-control—demonstrate superior adaptability to regime shifts compared to baseline methods. The work bridges cognitive theory with reinforcement learning, offering a scalable solution for real-world applications where human preferences evolve dynamically.
Key Points
- ▸ DPI introduces a probabilistic belief updating mechanism for latent preference weights under contextual shifts.
- ▸ The framework integrates variational inference with actor-critic architectures to condition policies on inferred preferences.
- ▸ Empirical results show improved post-shift performance relative to fixed-weight and heuristic baselines.
Merits
Strength in Cognitive Alignment
DPI’s design aligns with human behavioral patterns of dynamic preference evolution, enhancing ecological validity and applicability in human-in-the-loop systems.
Empirical Validation Across Domains
The framework’s adaptability is substantiated across multiple distinct environments, demonstrating generalizability.
Demerits
Complexity in Implementation
The integration of variational preference inference with actor-critic training may introduce computational overhead and require specialized expertise for deployment.
Limited Generalizability to Non-Latent Preference Systems
DPI assumes latent preference drift; it may not apply directly to systems where preferences are explicitly observable or fixed.
Expert Commentary
The authors make a significant theoretical and empirical contribution by formalizing a mechanism for inferring dynamically changing preferences—a concept that has been historically underrepresented in RL literature. Their use of vector-valued returns as evidence for latent trade-offs is innovative, as it transforms inherently noisy environmental feedback into structured, probabilistic signals about preference evolution. Importantly, DPI avoids the common pitfall of assuming fixed reward functions, which are fundamentally misaligned with human cognition. The cognitive inspiration grounds the methodology in behavioral science, lending credibility to the model’s predictive power. Moreover, the empirical validation across diverse domains—queueing, maze, and continuous-control—suggests robustness beyond niche applications. That said, the reliance on variational inference may limit scalability in real-time applications requiring low-latency decision-making. A future direction could explore lightweight approximations or hybrid architectures that balance accuracy with computational efficiency. Overall, DPI represents a paradigm shift toward more human-aligned decision-making systems.
Recommendations
- ✓ Researchers should extend DPI to hybrid architectures combining variational inference with sparse coding or neural symbolic systems for improved scalability.
- ✓ Industry practitioners should pilot DPI in adaptive user interfaces or customer service bots where preference drift is empirically observed.
Sources
Original: arXiv - cs.AI